Sales Demand Forecast in Ecommerce using a Long ShortTerm Memory Neural Network Methodology
Abstract
Generating accurate and reliable sales forecasts is crucial in the Ecommerce business. The current stateoftheart techniques are typically univariate methods, which produce forecasts considering only the historical sales data of a single product. However, in a situation where large quantities of related time series are available, conditioning the forecast of an individual time series on past behaviour of similar, related time series can be beneficial. Given that the product assortment hierarchy in an Ecommerce platform contains large numbers of related products, in which the sales demand patterns can be correlated, our attempt is to incorporate this crossseries information in a unified model. We achieve this by globally training a Long ShortTerm Memory network (LSTM) that exploits the nonlinear demand relationships available in an Ecommerce product assortment hierarchy. Aside from the forecasting engine, we propose a systematic preprocessing framework to overcome the challenges in an Ecommerce setting. We also introduce several product grouping strategies to supplement the LSTM learning schemes, in situations where sales patterns in a product portfolio are disparate. We empirically evaluate the proposed forecasting framework on a realworld online marketplace dataset from Walmart.com. Our method achieves competitive results on category level and superdepartmental level datasets, outperforming stateoftheart techniques.
I Introduction
Generating productlevel operational demand forecasts is a crucial factor in Ecommerce platforms. Accurate and reliable demand forecasts enable better inventory planning, competitive pricing, timely promotion planning, etc. While accurate forecasts can lead to huge savings and cost reductions, poor demand estimations are proven to be costly in this space.
The business environment in Ecommerce is highly dynamic and often volatile, which is largely caused by holiday effects, low productsales conversion rate, competitor behaviour, etc. As a result, demand data in this space carry various challenges, such as highly nonstationary historical data, irregular sales patterns, sparse sales data, highly intermittent sales, etc. Furthermore, product assortments in these platforms follow a hierarchical structure, where certain products within a subgroup of the hierarchy can be similar or related to each other. In essence, this hierarchical structure provides a natural grouping of the product portfolio, where items that fall in the same subcategory/category/department/superdepartment are considered as a single group, in which the sales patterns can be correlated.
The time series of such related products are correlated and may share key properties of demand. For example, increasing demand of an item may potentially cause to decrease/increase sales demand of another item, i.e., substituting/complimentary products. Therefore, accounting for the notion of similarity between these products becomes necessary to produce accurate and meaningful forecasts in the Ecommerce domain. An example of such related time series shows Fig. 1.
The existing demand forecasting methods in the Ecommerce domain are largely influenced by stateoftheart forecasting techniques from the exponential smoothing [1] and the ARIMA [2] families. However, these forecasting methods are univariate, thus treat each time series separately, and forecast them in isolation. As a result, though many related products are available, in which the sales demand patterns can be correlated, these univariate models ignore such potential crossseries information available within related products.
Consequently, efforts to untap the enormous potentials of such multiple related time series is becoming increasingly popular [13, 14, 15, 16, 17, 18]. More recently, Recurrent Neural Networks (RNN) and Long ShortTerm Memory Networks (LSTM), a special group of neural networks (NN) that are naturally suited for time series forecasting, have achieved promising results by globally training the network across all related time series that enables the network to exploit any crossseries information available [15, 16, 18].
In this study, we adapt the framework proposed in Bandara et al. [18] to a realworld demand forecasting problem for Ecommerce business, and extend the original contributions of [18] in the following ways.

We exploit sales correlations available in an Ecommerce product hierarchy. This accompanies a systematic preprocessing unit that addresses data challenges in the Ecommerce domain.

We analyze and compare two different LSTM learning schemes with different backpropagation error terms, and include a mix of static and dynamic features to incorporate potential external driving factors of sales demand.

Our framework is empirically evaluated using realworld retail sales data from Walmart.com, in which we use stateoftheart forecasting techniques to compare against our proposed framework.
The rest of the paper is organized as follows. In Section II we formally define the problem of generating a global time series model for product demand forecasting. In Section III we discuss the state of the art in this space. We describe the proposed preprocessing scheme in Section IV. Next, in Section V, we outline the key learning properties included in our LSTM network architecture. We summarise the overall architecture of our forecasting engine in Section VI. Our experimental setup is presented in Section VII, where we demonstrate the results obtained by applying our framework to a large dataset from Walmart.com. Finally, Section VIII concludes the paper.
Ii Problem Statement
Let be the th product from total products in our database. The previous sales demand values of product are given by , where represents the length of the time series. Additionally, we introduce an exogenous feature space, , where denotes the feature dimension of .
Our aim is to develop a prediction model , which uses the past sales data of all the products in the database, i.e., , and the exogenous feature set to forecast number of future sales demand points of product , i.e., }, where is the forecasting horizon. The model can be defined as follows:
(1) 
Here, are the model parameters, which are learned in the LSTM training process.
Iii Prior Work
The traditional demand forecast algorithms are largely influenced by stateoftheart univariate statistical forecasting methods such as exponential smoothing methods [1] and ARIMA models [2]. As described earlier, forecasting in the Ecommerce space commonly needs to address challenges such as irregular sales trends, presence of highly bursty and sparse sales data, etc. Nonetheless, numerous studies have been undertaken to alleviate the limitations of classical approaches in these challenging conditions. This includes introducing preprocessing techniques [3], feature engineering methods [4, 5, 6, 7], and modified likelihood functions [8, 9].
As emphasized in Section I, one major limitation of univariate forecasting techniques is that they are incapable of using crossseries information for forecasting. Also many studies based on NNs, which are recognised as a strong alternative to traditional approaches, have been employing NNs in the form of a univariate forecasting technique [10, 11, 12].
In addition to improving the forecasting accuracy, forecasting models that build on multiple related time series can positively contribute towards handling outliers in a time series. This is because, incorporating the common behaviour of multiple time series may reduce the effects caused by an abnormal observation in a single time series.
Recently, methods to build global models across such time series databases have achieved promising results. Trapero et al. [13] introduce pooling regression models on sets of related time series. They improve the promotional forecast accuracy in situations where historical sales data is limited in a single time series. Chapados [17] achieves good results in the supply chain planning domain by modelling multiple time series using a Bayesian framework, where that author uses the available hierarchical structure to disseminate the crossseries information across a set of time series. More recently, deep learning techniques, such as RNNs and CNNs have also shown to be competitive in this space [14, 15, 16, 18].
The probabilistic forecasting framework introduced by [15, 16] attempts to address the uncertainty factor in forecasting. Those authors use RNN and LSTM architectures to learn from groups of time series, and provide quantile estimations of the forecast distributions. Moreover, Bandara et al. [18] develop a clusteringbased forecasting framework to accommodate situations where groups of heterogeneous time series are available. Here, those authors initially group the time series into subgroups based on a similarity measure, before using RNNs to learn across each subgroup of time series. Furthermore, [14] apply CNNs to model similar sets of financial time series together, where they highlight that the global learning procedure improves both robustness and forecasting accuracy of a model, and also enables the network to effectively learn from shorter time series, where information available within an individual time series is limited.
Iv Data Preprocessing
Sales datasets in the Ecommerce environment experience various issues that we aim to address with the following preprocessing mechanisms in our framework.
Iva Fixing Data Quality Issues
Nowadays, many organisations use Extract, Transform, Load (ETL) as the main data integration methodology in data warehousing pipelines. However, the ETL process is often unstable in realtime processing, and may cause false “zero” sales in the dataset. Therefore, we distinguish the actual zero sales from the false zero sales (“fake zeros”) and treat the latter as missing observations.
Our approach is mostly heuristic, where we initially compute the minimum nonzero sales of each item in the past 6 months. Then, we treat the zero sales as “fake” zero sales if the minimum nonzero sales of a certain item are higher than a threshold . We treat these zero sales as missing observations. It is also noteworthy to mention that the ground truth of zero sales is not available, thus potential false positives can appear in the dataset.
IvB Handling Missing Values
We use a forwardfilling strategy to impute missing sales observations in the dataset. This approach uses the most recent valid observation available to replace the missing values. We performed preliminary experiments that showed that this approach outperforms more sophisticated imputation techniques such as linear regression and Classification And Regression Trees (CART).
IvC Product Grouping
According to [18], employing a time series grouping strategy can improve the LSTM performance in situations where time series are disparate. Therefore, we introduce two product grouping mechanisms in our preprocessing scheme.
In the first approach, the target products are grouped based on available domain knowledge. Here, we use the sales ranking and the percentage of zero sales as primary business metrics to form groups of products. The first group (G1) represents the product group with a (high) sales ranking and a (low) zero sales density. Whereas, group 2 (G2) represents the product group with a (low) sales ranking and a (high) zero sales density, and group 3 (G3) represents the rest of the products. These conditions are summarised in Table II. From an Ecommerce perspective, we recognise products in G1 bring the highest contribution to the business, thus improving the sales forecast accuracy in G1 is most important.
The second approach is based on time series clustering, where we perform Kmeans clustering on a set of time series features to identify the product grouping. Table I provides an overview of these features, where the first two features respresent business specific features, while the rest of them represent time series specific features. The time series specific features are extracted using the tsfeatures package developed by [36]. Finally, we use a silhouette analysis to determine the optimal number of clusters in the Kmeans setting.
IvD Sales Normalization
The product assortment hierarchy is composed of numerous commodities that follow various sales volume ranges, thus performing a data normalisation strategy becomes necessary before building a global model like ours. We use the meanscale transformation proposed by [15], where the mean sales of a product are considered as the scaling factor. This can be formally defined as follows:
(2) 
Here, represents the normalised sales vector, and denotes the number of sales observations.
Feature  Description 

Sales.quantile  Sales quantile over total sales 
Zero.sales.percentage  Sales sparsity/percentage of zero sales 
Trend  Strength of trend 
Spikiness  Strength of spikiness 
Linearity  Strength of linearity 
Curvature  Strength of curvature 
ACF1e  Autocorrelation coefficient at lag 1 of the residuals 
ACF1x  Autocorrelation coefficient at lag 1 
Entropy  Spectral entropy 
GroupID  Sales ranking  Sales sparsity 

1  Sales.quantile 0.33  Zero.sales.percentage.quantile 0.67 
2  Sales.quantile 0.67  Zero.sales.percentage.quantile 0.33 
3  other  other 
IvE Moving Window Approach
The Moving Window (MW) strategy transforms a time series () into pairs of input, output patches, which are later used as the training data of the LSTM.
Given a time series = {} of length , the MW strategy converts the into number of patches, where each patch has a size of . Here, and represent the sizes of the input window and output window, respectively. In our study, we make the size of the output window () identical to the intended forecasting horizon, following the MultiInput MultiOutput (MIMO) strategy in multistep forecasting. This enables our model to directly predict all future values up to the intended forecasting horizon . The MIMO strategy is advocated by many studies [23, 16] for multistep forecasting with NNs. Fig. 2 illustrates an example of applying the MW approach to a sales demand time series from our dataset.
We use an amount of data points from time series to train the LSTM, and reserve the last output window of for the network validation.
Also, to avoid possible network saturation effects, which are caused by the bounds of the network activation functions [28], we employ a local normalisation process at each MW step. In this step, the mean value for each input window () is calculated and subtracted from each data point of the corresponding input and output window. Thereafter, these windows are shifted forward by one step, i.e., , , and the normalisation process is repeated. The normalisation procedure also enables the network to generate conservative forecasts (for details see Bandara et al. [18]), which is beneficial in forecasting in general, and in particular in the Ecommerce setting, as this reduces the risk of generating large demand forecasting errors.
V LSTM Network Architecture
LSTMs are an extension of RNNs that have the ability to learn longterm dependencies in a sequence, overcoming the limitations of vanilla RNNs [21]. The cohesive gating mechanism, i.e., input, output, and forget gates, together with the selfcontained memory cell, i.e., “Constant Error Carousel” (CEC) allow the LSTM to regulate the information flow across the network. This enables the LSTM to propagate the network error for much longer sequences, while capturing their longterm temporal dependencies.
In this study, we use a special variant of LSTMs, known as “LSTM with peephole connections” that requires the LSTM input and forget gates to incorporate the previous state of the LSTM memory cell. For further discussions of RNN and LSTM architectures, we refer to [18]. In the following, we describe how exactly the LSTM architecture is used in our work.
Va Learning Schemes
As mentioned earlier, we use the input and output data frames generated from the MW procedure as the primary training source of LSTM. Therefore, the LSTM is provided with an array of lagged values as the input data, instead of feeding in a single observation at a time. This essentially relaxes the LSTM memory regulation and allows the network to learn directly from a lagged time series [18].
Fig. 3 summarizes the LSTM learning schemes used in our study, LSTMLS1 and LSTMLS2. Here, represents the input window at time step , represents the hidden state at time step , and the cell state at time step is represented by . Note that denotes the dimension of the memory cell of the LSTM. Additionally, we introduce to represent the projected output of the LSTM at time step . Here, denotes our output window size, which is equivalent to the forecasting horizon .
Here, each LSTM layer is followed by a fully connected neural layer (excluding the bias component) to project each LSTM cell output to the dimension of the output window .
The proposed learning schemes can be distinguished by the overall error term used in the network backpropagation, which is backpropagation through time (BPTT;[20]). Given are the actual observations of values in the output window at time step , which are used as the teacher inputs for the predictions , the LSTMLS1 scheme accumulates the error of each LSTM cell instance to compute the error of the network. Here, refers to the prediction error at time step , where . Whereas in LSTMLS2, only the error term of the final LSTM cell instance is used as the error for the network training. For example, in Fig. 3, the of LSTMLS1 scheme is equivalent to , whereas the error term in the final LSTM cell state gives the error of LSTMLS2. These error terms are eventually used to update the network parameters, i.e., the LSTM weight matrices.
In this study, we use TensorFlow, an opensource deeplearning toolkit [29] to implement the above LSTM learning schemes.
VB Exogenous Variables
We use a combination of static and dynamic features to model external factors that affect the sales demand. In general, static features contain time invariant information, such as product class, product category, etc. Dynamic features include calendar related information available (e.g., holidays, season, weekday/weekend). These features can be useful in capturing demand behaviours of products in a certain period of time.
Fig. 4 demonstrates an example of applying the MW approach (see Section IVE) to include static and dynamic features in an input window. Now, the input window is a unified vector of past sales observations , static features , and dynamic features . As a result, in addition to past sales observations , we use the input window of the holidays , the input window of seasons , the input window of days of the week , and the input window of sub category types . Later, LSTM uses a concatenation of these input windows to learn the actual observation of the output window .
Vi Overall procedure
The proposed forecasting framework is composed of three processing phases, namely 1) preprocessing layer, 2) LSTM training layer, and 3) postprocessing layer. Fig. 5 gives a schematic overview of our proposed forecasting framework.
As described in Section IV, we initially conduct a series of preprocessing steps to arrange the raw data for the LSTM training procedure. Afterwards, the LSTM models are trained according to the LSTMLS1 and LSTMLS2 learning schemes shown in Fig. 3. Then, in order to obtain the final forecasts, we rescale and denormalize the predictions produced by the LSTM. Here, the rescaling process backtransforms the generated forecasts to their original scale of sales, whereas the denormalization process (see Section IVE) adds back the mean sales of the last input window to the forecasts.
Vii Experiments
In this section, we describe the experimental setup used to empirically evaluate our proposed forecasting framework. This includes the datasets, error metrics, hyperparameter selection method, benchmark methods and LSTM variants used to perform the experiments, and the results obtained.
Viia Datasets
We evaluate our forecasting framework on datasets collected from Walmart.com. Initially, we evaluate our framework on a subset of 1724 items that belong to the product household category, which consists of 15 different subcategories. Next, we scale up the number of products to 18254 by extracting a collection from a single superdepartment, which consists of 16 different categories.
We use 190 consecutive days sales data in 2018. The last 10 days of data are reserved for model testing. We define our forecasting horizon as 10, i.e., training output window size is equivalent to 10. Following the heuristic proposed by [18], we choose the size of the training input window as 13 (10*1.25).
Model Parameter  Minimum value  Maximum value 

LSTMcelldimension  50  100 
Minibatchsize  60  1500 
Learningratespersample  
Maximumepochs  5  20 
Gaussiannoiseinjection  
L2regularizationweight 
ViiB Error Measure
We use the mean absolute percentage error (MAPE) as our forecasting error metric. We define the MAPE as:
(3) 
Here, represents the actual sales demand at time , and is the respective sales forecast generated by a prediction model. The number denotes the amount of sales data points in the test set, in which the length is equivalent to the intended forecasting horizon. Furthermore, to avoid problems for zero values, we sum a constant term to the denominator of (3).
In addition to the mean of the MAPEs (Mean MAPE), we also produce the median of the MAPEs (Median MAPE), which is suitable to summarise the error distribution in situations where the majority of the observations are zero sales, i.e., long tailed sales demand items.
MAPE (All)  MAPE (G1)  MAPE (G2)  MAPE (G3)  

k = 1724  k = 549  k = 544  k = 631  
Model  Configuration  Mean  Median  Mean  Median  Mean  Median  Mean  Median 
LSTM.ALL  LSTMLS1/Bayesian/Adam  0.888  0.328  1.872  0.692  0.110  0.073  0.640  0.283 
LSTM.ALL  LSTMLS1/Bayesian/COCOB  0.803  0.267  1.762  0.791  0.070  0.002  0.537  0.259 
LSTM.ALL  LSTMLS2/Bayesian/Adam  0.847  0.327  1.819  0.738  0.103  0.047  0.582  0.326 
LSTM.GROUP  LSTMLS1/Bayesian/Adam  0.873  0.302  1.882  0.667  0.093  0.016  0.604  0.283 
LSTM.GROUP  LSTMLS1/Bayesian/COCOB  1.039  0.272  2.455  0.818  0.074  0.000  0.549  0.250 
LSTM.GROUP  LSTMLS2/Bayesian/Adam  0.812  0.317  1.818  0.738  0.091  0.022  0.587  0.314 
LSTM.FEATURE  LSTMLS1/Bayesian/Adam  1.065  0.372  2.274  0.889  0.135  0.100  0.738  0.388 
LSTM.FEATURE  LSTMLS1/Bayesian/COCOB  0.800  0.267  1.758  0.772  0.069  0.000  0.533  0.255 
LSTM.FEATURE  LSTMLS2/Bayesian/Adam  0.879  0.324  1.886  0.750  0.091  0.022  0.611  0.324 
LSTM.CLUSTER  LSTMLS1/Bayesian/Adam  0.954  0.313  2.109  0.869  0.135  0.110  0.625  0.322 
LSTM.CLUSTER  LSTMLS1/Bayesian/COCOB  0.793  0.308  1.695  0.748  0.077  0.005  0.562  0.302 
LSTM.CLUSTER  LSTMLS2/Bayesian/Adam  1.001  0.336  2.202  0.863  0.084  0.017  0.664  0.347 
EWMA  _  0.968  0.342  1.983  1.026  0.107  0.021  0.762  0.412 
ARIMA  _  1.153  0.677  2.322  0.898  0.103  0.056  0.730  0.496 
ETS (nonseasonal)  _  0.965  0.362  2.020  0.803  0.113  0.060  0.713  0.444 
ETS (seasonal)  _  0.983  0.363  2.070  0.804  0.116  0.059  0.713  0.445 
Naïve  _  0.867  0.250  1.803  0.795  0.124  0.000  0.632  0.250 
Naïve Seasonal  _  0.811  0.347  1.789  0.679  0.086  0.000  0.523  0.320 
MAPE (All items)  MAPE (G1)  MAPE (G2)  MAPE (G3)  

k = 18254  k = 5682  k = 5737  k = 6835  
Model  Configuration  Mean  Median  Mean  Median  Mean  Median  Mean  Median 
LSTM.ALL  LSTMLS1/Bayesian/Adam  1.006  0.483  2.146  1.285  0.191  0.079  0.668  0.434 
LSTM.ALL  LSTMLS1/Bayesian/COCOB  0.944  0.442  2.041  1.203  0.163  0.053  0.614  0.394 
LSTM.GROUP  LSTMLS1/Bayesian/Adam  0.871  0.445  1.818  1.009  0.189  0.067  0.603  0.377 
LSTM.GROUP  LSTMLS1/Bayesian/COCOB  0.921  0.455  1.960  1.199  0.173  0.053  0.618  0.394 
LSTM.FEATURE  LSTMLS1/Bayesian/Adam  0.979  0.424  2.117  1.279  0.151  0.050  0.653  0.377 
LSTM.FEATURE  LSTMLS1/Bayesian/COCOB  1.000  0.443  2.143  1.282  0.215  0.092  0.676  0.398 
EWMA  _  1.146  0.579  2.492  1.650  0.229  0.091  0.805  0.562 
ARIMA  _  1.084  0.536  2.305  1.497  0.198  0.094  0.734  0.510 
ETS (nonseasonal)  _  1.097  0.527  2.314  1.494  0.204  0.092  0.755  0.509 
ETS (seasonal)  _  1.089  0.528  2.290  1.483  0.204  0.092  0.756  0.510 
Naïve  _  0.981  0.363  2.008  1.122  0.204  0.000  0.713  0.286 
Naïve Seasonal  _  1.122  0.522  2.323  1.513  0.219  0.050  0.803  0.475 
ViiC Hyperparameter Selection & Optimization
Our LSTM based learning framework contains various hyperparameters, including LSTM cell dimension, model learning rate, number of epochs, minibatchsize, and model regularization terms, i.e., Gaussiannoise and L2regularization weights. We use two implementations of a Bayesian global optimization methodology, bayesianoptimization and SMAC [30] to autonomously determine the optimal set of hyperparameters in our model [32]. Table III summarises the bounds of the hyperparameter values used throughout the LSTM learning process, represented by the respective minimum and maximum columns.
ViiD Benchmarks and LSTM variants
We use a host of different univariate forecasting techniques to benchmark against our proposed forecasting framework. This includes forecasting methods from the exponential smoothing family, i.e., exponentially weighted moving average (EWMA), exponential smoothing (ETS) [35], and methods from the moving average family, i.e., Autoregressive–movingaverage model (ARIMA) [35]. Furthermore, we include standard benchmarks in forecasting, Naïve, and Naïve Seasonal. Some of these benchmarks are currently used in the forecasting framework at Walmart.com.
Furthermore, in our experiments, we add the following variants of our baseline LSTM model.

LSTM.ALL: The baseline LSTM model, where one model is globally trained across all the available time series.

LSTM.GROUP: A separate LSTM model is built on each subgroup of time series, which are identified by the domain knowledge available.

LSTM.FEATURE: The subgroup labels identified in the LSTM.GROUP approach is used as an external feature (onehot encoded vector) of LSTM.

LSTM.CLUSTER: The time series subgrouping is performed using a time series feature based clustering approach (refer Section IV). Similar to LSTM.GROUP, a separate LSTM model is trained on each cluster.
ViiE Results & Discussion
Table IV and Table V show the results for the category level and superdepartment level datasets. Here, corresponds to the number of items in each group. We use a weekly seasonality in the seasonal benchmarks, i.e., ETS (seasonal), Naïve Seasonal. It is also noteworthy to mention that for the superdepartment dataset, we only employ one grouping strategy, namely LSTM.GROUP, and include only the bestperforming learning scheme in the category level dataset, which is LSTMLS1, to examine the robustness of our forecasting framework.
In the tables, under each LSTM variant, we present the results of the different learning schemes, i.e., LSTMLS1 and LSTMLS2, hyperparameter selection methods, i.e., Bayesian and SMAC, and optimization learning algorithms, i.e., Adam and COCOB, and achieve comparable results.
According to Table IV, considering all the items in the category, the proposed LSTM.Cluster variant obtains the best Mean MAPE, while the Naïve forecast gives the best Median MAPE. Meanwhile, regarding G1, which are the items with most business impact, the LSTM.Cluster and LSTM.Group variants outperform the rest of the benchmarks, in terms of the Mean MAPE and Median MAPE respectively. We also observe in G1 that the results of the LSTM.ALL variant are improved after applying our grouping strategies. Furthermore, on average, the LSTM variants together with the Naïve forecast achieve the bestperforming results within G2 and G3, where the product sales are relatively sparse compared to G1.
We observe a similar pattern of results in Table V, where holistically, the LSTM.GROUP variant gives the best Mean MAPE, while the Naïve forecast ranks as the first in Median MAPE. Likewise in G1, the LSTM.GROUP variant performs superior amongst other benchmarks, and in particular outperforms the LSTM.ALL variant, while upholding the benefits of item grouping strategies under these circumstances. Similarly, on average, the LSTM variants and Naïve forecast obtain the best results in G2 and G3.
Overall, the majority of the LSTM variants display competitive results under both evaluation settings, showing the robustness of our forecasting framework with large amounts of items. More importantly, these results reflect the contribution made by the time series grouping strategies to uplift the baseline LSTM performance.
Viii Conclusions
There exists great potential to improve sales forecasting accuracy in the Ecommerce domain. One good opportunity is to utilize the correlated and similar sales patterns available in a product portfolio. In this paper, we have introduced a novel demand forecasting framework based on LSTMs that exploits nonlinear relationships that exist in Ecommerce business data.
We have used the proposed approach to forecast the sales demand by training a global model across the items available in a product assortment hierarchy. Our developments also present several systematic grouping strategies to our base model, which are in particular useful in situations where product sales are sparse.
Our methodology has been evaluated on a realworld Ecommerce database from Walmart.com. To demonstrate the robustness of our framework, we have assessed our propositions on category level and superdepartment level datasets. The results have shown that our methods have outperformed the stateoftheart univariate forecasting techniques.
Furthermore, the results indicate that Ecommerce product hierarchies contain various crossproduct demand patterns and correlations are available, and approaches to exploit this information are necessary to uplift the sales forecasting accuracy in this space.
References
 [1] Hyndman, R. et al., 2008. Forecasting with Exponential Smoothing: The State Space Approach, Springer Science & Business Media.
 [2] Box, G.E.P. et al., 2015. Time Series Analysis: Forecasting and Control, John Wiley & Sons.
 [3] Box, G.E.P. & Cox, D.R., 1964. An Analysis of Transformations. Journal of the Royal Statistical Society. Series B, Statistical methodology, 26(2), pp.211–252.
 [4] Yeo, J. et al., 2016. Browsing2purchase: Online Customer Model for Sales Forecasting in an ECommerce Site. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, pp. 133–134.
 [5] Ramanathan, U., 2013. Supply chain collaboration for improved forecast accuracy of promotional sales. International Journal of Operations & Production Management.
 [6] Kulkarni, G., Kannan, P.K. & Moe, W., 2012. Using online search data to forecast new product sales. Decision support systems, 52(3), pp.604–611.
 [7] Zhao, K. & Wang, C., 2017. Sales Forecast in Ecommerce using Convolutional Neural Network. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1708.07946.
 [8] Seeger, M.W., Salinas, D. & Flunkert, V., 2016. Bayesian Intermittent Demand Forecasting for Large Inventories. In D. D. Lee et al., eds. Advances in Neural Information Processing Systems 29. Curran Associates, Inc., pp. 4646–4654.
 [9] Snyder, R., Ord, J.K. & Beaumont, A., 2012. Forecasting the intermittent demand for slowmoving inventories: A modelling approach. International journal of forecasting, 28(2), pp.485–496.
 [10] Zhang, G., Patuwo, B.E. & Hu, M.Y., 1998. Forecasting with artificial neural networks: The state of the art. International journal of forecasting, 14(1), pp.35–62.
 [11] Yan, W., 2012. Toward automatic timeseries forecasting using neural networks. IEEE transactions on neural networks and learning systems, 23(7), pp.1028–1039.
 [12] Zimmermann, H.G., Tietz, C. & Grothmann, R., 2012. Forecasting with Recurrent Neural Networks: 12 Tricks. In Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 687–707.
 [13] Trapero, J.R., Kourentzes, N. & Fildes, R., 2015. On the identification of sales forecasting models in the presence of promotions. The Journal of the Operational Research Society, 66(2), pp.299–307.
 [14] Borovykh, A., Bohte, S. & Oosterlee, C.W., 2017. Conditional Time Series Forecasting with Convolutional Neural Networks. arXiv [stat.ML]. Available at: http://arxiv.org/abs/1703.04691.
 [15] Flunkert, V., Salinas, D. & Gasthaus, J., 2017. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. arXiv [cs.AI]. Available at: http://arxiv.org/abs/1704.04110.
 [16] Wen, R. et al., 2017. A MultiHorizon Quantile Recurrent Forecaster. arXiv [stat.ML]. Available at: http://arxiv.org/abs/1711.11053.
 [17] Chapados, N., 2014. Effective Bayesian Modeling of Groups of Related Count Time Series. In E. P. Xing & T. Jebara, eds. Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research. Bejing, China: PMLR, pp. 1395–1403.
 [18] Bandara, K., Bergmeir, C.& Smyl, S., 2017. Forecasting Across Time Series Databases Using Recurrent Neural Networks on Groups of Similar Series: A Clustering Approach. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1710.03222.
 [19] Elman, J.L., 1990. Finding Structure in Time. Cognitive science, 14(2), pp.179–211.
 [20] Williams, R.J. & Zipser, D., 1995. Gradientbased learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications, 1, pp.433–486.
 [21] Bengio, Y., Simard, P. & Frasconi, P., 1994. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 5(2), pp.157–166.
 [22] Hochreiter, S. & Schmidhuber, J., 1997. Long ShortTerm Memory. Neural computation, 9(8), pp.1735–1780.
 [23] Sutskever, I., Vinyals, O. & Le, Q.V., 2014. Sequence to Sequence Learning with Neural Networks. In Z. Ghahramani et al., eds. Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp. 3104–3112.
 [24] Gregor, K. et al., 2015. DRAW: A Recurrent Neural Network For Image Generation. arXiv [cs.CV]. Available at: http://arxiv.org/abs/1502.04623.
 [25] Graves, A., r. Mohamed, A. & Hinton, G., 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. ieeexplore.ieee.org, pp. 6645–6649.
 [26] Lipton, Z.C. et al., 2015. Learning to Diagnose with LSTM Recurrent Neural Networks. Available at: http://arxiv.org/abs/1511.03677.
 [27] Ben Taieb, S. et al., 2012. A review and comparison of strategies for multistep ahead time series forecasting based on the NN5 forecasting competition. Expert systems with applications, 39(8), pp.7067–7083.
 [28] Ord, K., Fildes, R.A. & Kourentzes, N., 2017. Principles of Business Forecasting.2nd ed, Wessex Press Publishing Co.
 [29] Abadi, M. et al., 2016. TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems. arXiv [cs.DC]. Available at: http://arxiv.org/abs/1603.04467.
 [30] Hutter, F., Hoos, H.H. & LeytonBrown, K., Sequential ModelBased Optimization for General Algorithm Configuration (extended version).
 [31] Fernando, 2017. bayesianoptimization: Bayesian Optimization of Hyperparameters, Github. Available at: https://bit.ly/2EssG1r [Accessed November 3, 2018]
 [32] Snoek, J., Larochelle, H. & Adams, R.P., 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In F. Pereira et al., eds. Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 2951–2959
 [33] Kingma, D.P. & Ba, J., 2014. Adam: A Method for Stochastic Optimization. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1412.6980.
 [34] Orabona, Orabona, F. & Tommasi, T., 2017. Training Deep Networks without Learning Rates Through Coin Betting. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1705.07795.
 [35] Hyndman, R.J. & Khandakar, Y., 2008. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3), pp.1–22. Available at: http://www.jstatsoft.org/article/view/v027i03.
 [36] Hyndman, R.J. et al., 2018. Time series features R package. Available at https://bit.ly/2GekHql.