Sales Demand Forecast in E-commerce using a Long Short-Term Memory Neural Network Methodology

Sales Demand Forecast in E-commerce using a Long Short-Term Memory Neural Network Methodology

Kasun Bandara1, Peibei Shi2, Christoph Bergmeir1, Hansika Hewamalage1, Quoc Tran2, Brian Seaman2 1Faculty of Information Technology
Monash University, Melbourne, Australia.
herath.bandara@monash.edu, christoph.bergmeir@monash.edu, hansika.hewamalage@monash.edu
2Smart Pricing, @Walmart Labs
San Bruno, USA
pshi@walmartlabs.com, qtran@walmartlabs.com, brian@walmartlabs.com
Abstract

Generating accurate and reliable sales forecasts is crucial in the E-commerce business. The current state-of-the-art techniques are typically univariate methods, which produce forecasts considering only the historical sales data of a single product. However, in a situation where large quantities of related time series are available, conditioning the forecast of an individual time series on past behaviour of similar, related time series can be beneficial. Given that the product assortment hierarchy in an E-commerce platform contains large numbers of related products, in which the sales demand patterns can be correlated, our attempt is to incorporate this cross-series information in a unified model. We achieve this by globally training a Long Short-Term Memory network (LSTM) that exploits the non-linear demand relationships available in an E-commerce product assortment hierarchy. Aside from the forecasting engine, we propose a systematic pre-processing framework to overcome the challenges in an E-commerce setting. We also introduce several product grouping strategies to supplement the LSTM learning schemes, in situations where sales patterns in a product portfolio are disparate. We empirically evaluate the proposed forecasting framework on a real-world online marketplace dataset from Walmart.com. Our method achieves competitive results on category level and super-departmental level datasets, outperforming state-of-the-art techniques.

E-Commerce, Demand Forecasting, LSTM

I Introduction

Generating product-level operational demand forecasts is a crucial factor in E-commerce platforms. Accurate and reliable demand forecasts enable better inventory planning, competitive pricing, timely promotion planning, etc. While accurate forecasts can lead to huge savings and cost reductions, poor demand estimations are proven to be costly in this space.

The business environment in E-commerce is highly dynamic and often volatile, which is largely caused by holiday effects, low product-sales conversion rate, competitor behaviour, etc. As a result, demand data in this space carry various challenges, such as highly non-stationary historical data, irregular sales patterns, sparse sales data, highly intermittent sales, etc. Furthermore, product assortments in these platforms follow a hierarchical structure, where certain products within a subgroup of the hierarchy can be similar or related to each other. In essence, this hierarchical structure provides a natural grouping of the product portfolio, where items that fall in the same subcategory/category/department/super-department are considered as a single group, in which the sales patterns can be correlated.

The time series of such related products are correlated and may share key properties of demand. For example, increasing demand of an item may potentially cause to decrease/increase sales demand of another item, i.e., substituting/complimentary products. Therefore, accounting for the notion of similarity between these products becomes necessary to produce accurate and meaningful forecasts in the E-commerce domain. An example of such related time series shows Fig. 1.

Fig. 1: Daily sales demand of four different products over a four months period, extracted from Walmart.com. These products are collected from the same product assortment sub-hierarchy.

The existing demand forecasting methods in the E-commerce domain are largely influenced by state-of-the-art forecasting techniques from the exponential smoothing [1] and the ARIMA [2] families. However, these forecasting methods are univariate, thus treat each time series separately, and forecast them in isolation. As a result, though many related products are available, in which the sales demand patterns can be correlated, these univariate models ignore such potential cross-series information available within related products.

Consequently, efforts to untap the enormous potentials of such multiple related time series is becoming increasingly popular [13, 14, 15, 16, 17, 18]. More recently, Recurrent Neural Networks (RNN) and Long Short-Term Memory Networks (LSTM), a special group of neural networks (NN) that are naturally suited for time series forecasting, have achieved promising results by globally training the network across all related time series that enables the network to exploit any cross-series information available [15, 16, 18].

In this study, we adapt the framework proposed in Bandara et al. [18] to a real-world demand forecasting problem for E-commerce business, and extend the original contributions of [18] in the following ways.

  • We exploit sales correlations available in an E-commerce product hierarchy. This accompanies a systematic preprocessing unit that addresses data challenges in the E-commerce domain.

  • We analyze and compare two different LSTM learning schemes with different back-propagation error terms, and include a mix of static and dynamic features to incorporate potential external driving factors of sales demand.

  • Our framework is empirically evaluated using real-world retail sales data from Walmart.com, in which we use state-of-the-art forecasting techniques to compare against our proposed framework.

The rest of the paper is organized as follows. In Section II we formally define the problem of generating a global time series model for product demand forecasting. In Section III we discuss the state of the art in this space. We describe the proposed preprocessing scheme in Section IV. Next, in Section V, we outline the key learning properties included in our LSTM network architecture. We summarise the overall architecture of our forecasting engine in Section VI. Our experimental setup is presented in Section VII, where we demonstrate the results obtained by applying our framework to a large dataset from Walmart.com. Finally, Section VIII concludes the paper.

Ii Problem Statement

Let be the th product from total products in our database. The previous sales demand values of product are given by , where represents the length of the time series. Additionally, we introduce an exogenous feature space, , where denotes the feature dimension of .

Our aim is to develop a prediction model , which uses the past sales data of all the products in the database, i.e., , and the exogenous feature set to forecast number of future sales demand points of product , i.e., }, where is the forecasting horizon. The model can be defined as follows:

(1)

Here, are the model parameters, which are learned in the LSTM training process.

Iii Prior Work

The traditional demand forecast algorithms are largely influenced by state-of-the-art univariate statistical forecasting methods such as exponential smoothing methods [1] and ARIMA models [2]. As described earlier, forecasting in the E-commerce space commonly needs to address challenges such as irregular sales trends, presence of highly bursty and sparse sales data, etc. Nonetheless, numerous studies have been undertaken to alleviate the limitations of classical approaches in these challenging conditions. This includes introducing preprocessing techniques [3], feature engineering methods [4, 5, 6, 7], and modified likelihood functions [8, 9].

As emphasized in Section I, one major limitation of univariate forecasting techniques is that they are incapable of using cross-series information for forecasting. Also many studies based on NNs, which are recognised as a strong alternative to traditional approaches, have been employing NNs in the form of a univariate forecasting technique [10, 11, 12].

In addition to improving the forecasting accuracy, forecasting models that build on multiple related time series can positively contribute towards handling outliers in a time series. This is because, incorporating the common behaviour of multiple time series may reduce the effects caused by an abnormal observation in a single time series.

Recently, methods to build global models across such time series databases have achieved promising results. Trapero et al. [13] introduce pooling regression models on sets of related time series. They improve the promotional forecast accuracy in situations where historical sales data is limited in a single time series. Chapados [17] achieves good results in the supply chain planning domain by modelling multiple time series using a Bayesian framework, where that author uses the available hierarchical structure to disseminate the cross-series information across a set of time series. More recently, deep learning techniques, such as RNNs and CNNs have also shown to be competitive in this space [14, 15, 16, 18].

The probabilistic forecasting framework introduced by [15, 16] attempts to address the uncertainty factor in forecasting. Those authors use RNN and LSTM architectures to learn from groups of time series, and provide quantile estimations of the forecast distributions. Moreover, Bandara et al. [18] develop a clustering-based forecasting framework to accommodate situations where groups of heterogeneous time series are available. Here, those authors initially group the time series into subgroups based on a similarity measure, before using RNNs to learn across each subgroup of time series. Furthermore, [14] apply CNNs to model similar sets of financial time series together, where they highlight that the global learning procedure improves both robustness and forecasting accuracy of a model, and also enables the network to effectively learn from shorter time series, where information available within an individual time series is limited.

Iv Data Preprocessing

Sales datasets in the E-commerce environment experience various issues that we aim to address with the following preprocessing mechanisms in our framework.

Iv-a Fixing Data Quality Issues

Nowadays, many organisations use Extract, Transform, Load (ETL) as the main data integration methodology in data warehousing pipelines. However, the ETL process is often unstable in real-time processing, and may cause false “zero” sales in the dataset. Therefore, we distinguish the actual zero sales from the false zero sales (“fake zeros”) and treat the latter as missing observations.

Our approach is mostly heuristic, where we initially compute the minimum non-zero sales of each item in the past 6 months. Then, we treat the zero sales as “fake” zero sales if the minimum non-zero sales of a certain item are higher than a threshold . We treat these zero sales as missing observations. It is also noteworthy to mention that the ground truth of zero sales is not available, thus potential false positives can appear in the dataset.

Iv-B Handling Missing Values

We use a forward-filling strategy to impute missing sales observations in the dataset. This approach uses the most recent valid observation available to replace the missing values. We performed preliminary experiments that showed that this approach outperforms more sophisticated imputation techniques such as linear regression and Classification And Regression Trees (CART).

Iv-C Product Grouping

According to [18], employing a time series grouping strategy can improve the LSTM performance in situations where time series are disparate. Therefore, we introduce two product grouping mechanisms in our preprocessing scheme.

In the first approach, the target products are grouped based on available domain knowledge. Here, we use the sales ranking and the percentage of zero sales as primary business metrics to form groups of products. The first group (G1) represents the product group with a (high) sales ranking and a (low) zero sales density. Whereas, group 2 (G2) represents the product group with a (low) sales ranking and a (high) zero sales density, and group 3 (G3) represents the rest of the products. These conditions are summarised in Table II. From an E-commerce perspective, we recognise products in G1 bring the highest contribution to the business, thus improving the sales forecast accuracy in G1 is most important.

The second approach is based on time series clustering, where we perform K-means clustering on a set of time series features to identify the product grouping. Table I provides an overview of these features, where the first two features respresent business specific features, while the rest of them represent time series specific features. The time series specific features are extracted using the tsfeatures package developed by [36]. Finally, we use a silhouette analysis to determine the optimal number of clusters in the K-means setting.

Iv-D Sales Normalization

The product assortment hierarchy is composed of numerous commodities that follow various sales volume ranges, thus performing a data normalisation strategy becomes necessary before building a global model like ours. We use the mean-scale transformation proposed by [15], where the mean sales of a product are considered as the scaling factor. This can be formally defined as follows:

(2)

Here, represents the normalised sales vector, and denotes the number of sales observations.

Feature Description
Sales.quantile Sales quantile over total sales
Zero.sales.percentage Sales sparsity/percentage of zero sales
Trend Strength of trend
Spikiness Strength of spikiness
Linearity Strength of linearity
Curvature Strength of curvature
ACF1-e Autocorrelation coefficient at lag 1 of the residuals
ACF1-x Autocorrelation coefficient at lag 1
Entropy Spectral entropy
TABLE I: Product Clustering Features
Group-ID Sales ranking Sales sparsity
1 Sales.quantile 0.33 Zero.sales.percentage.quantile 0.67
2 Sales.quantile 0.67 Zero.sales.percentage.quantile 0.33
3 other other
TABLE II: Product Grouping Thresholds

Iv-E Moving Window Approach

The Moving Window (MW) strategy transforms a time series () into pairs of input, output patches, which are later used as the training data of the LSTM.

Given a time series = {} of length , the MW strategy converts the into number of patches, where each patch has a size of . Here, and represent the sizes of the input window and output window, respectively. In our study, we make the size of the output window () identical to the intended forecasting horizon, following the Multi-Input Multi-Output (MIMO) strategy in multi-step forecasting. This enables our model to directly predict all future values up to the intended forecasting horizon . The MIMO strategy is advocated by many studies [23, 16] for multi-step forecasting with NNs. Fig. 2 illustrates an example of applying the MW approach to a sales demand time series from our dataset.

Fig. 2: Applying the MW approach to time series . Here, refers to the input window, and is the corresponding output window.

We use an amount of data points from time series to train the LSTM, and reserve the last output window of for the network validation.

Also, to avoid possible network saturation effects, which are caused by the bounds of the network activation functions [28], we employ a local normalisation process at each MW step. In this step, the mean value for each input window () is calculated and subtracted from each data point of the corresponding input and output window. Thereafter, these windows are shifted forward by one step, i.e., , , and the normalisation process is repeated. The normalisation procedure also enables the network to generate conservative forecasts (for details see Bandara et al. [18]), which is beneficial in forecasting in general, and in particular in the E-commerce setting, as this reduces the risk of generating large demand forecasting errors.

V LSTM Network Architecture

LSTMs are an extension of RNNs that have the ability to learn long-term dependencies in a sequence, overcoming the limitations of vanilla RNNs [21]. The cohesive gating mechanism, i.e., input, output, and forget gates, together with the self-contained memory cell, i.e., “Constant Error Carousel” (CEC) allow the LSTM to regulate the information flow across the network. This enables the LSTM to propagate the network error for much longer sequences, while capturing their long-term temporal dependencies.

In this study, we use a special variant of LSTMs, known as “LSTM with peephole connections” that requires the LSTM input and forget gates to incorporate the previous state of the LSTM memory cell. For further discussions of RNN and LSTM architectures, we refer to [18]. In the following, we describe how exactly the LSTM architecture is used in our work.

V-a Learning Schemes

As mentioned earlier, we use the input and output data frames generated from the MW procedure as the primary training source of LSTM. Therefore, the LSTM is provided with an array of lagged values as the input data, instead of feeding in a single observation at a time. This essentially relaxes the LSTM memory regulation and allows the network to learn directly from a lagged time series [18].

Fig. 3 summarizes the LSTM learning schemes used in our study, LSTM-LS1 and LSTM-LS2. Here, represents the input window at time step , represents the hidden state at time step , and the cell state at time step is represented by . Note that denotes the dimension of the memory cell of the LSTM. Additionally, we introduce to represent the projected output of the LSTM at time step . Here, denotes our output window size, which is equivalent to the forecasting horizon .

Here, each LSTM layer is followed by a fully connected neural layer (excluding the bias component) to project each LSTM cell output to the dimension of the output window .

The proposed learning schemes can be distinguished by the overall error term used in the network back-propagation, which is back-propagation through time (BPTT;[20]). Given are the actual observations of values in the output window at time step , which are used as the teacher inputs for the predictions , the LSTM-LS1 scheme accumulates the error of each LSTM cell instance to compute the error of the network. Here, refers to the prediction error at time step , where . Whereas in LSTM-LS2, only the error term of the final LSTM cell instance is used as the error for the network training. For example, in Fig. 3, the of LSTM-LS1 scheme is equivalent to , whereas the error term in the final LSTM cell state gives the error of LSTM-LS2. These error terms are eventually used to update the network parameters, i.e., the LSTM weight matrices.

(a) An unrolled representation of learning scheme LSTM-LS1

(b) An unrolled representation of learning scheme LSTM-LS2
Fig. 3: The architectures of LSTM learning schemes, LSTM-LS1 and LSTM-LS2. Each squared unit represents a peephole connected LSTM cell, where provides short-term memory and retains the long-term dependencies of LSTM.

In this study, we use TensorFlow, an open-source deep-learning toolkit [29] to implement the above LSTM learning schemes.

V-B Exogenous Variables

We use a combination of static and dynamic features to model external factors that affect the sales demand. In general, static features contain time invariant information, such as product class, product category, etc. Dynamic features include calendar related information available (e.g., holidays, season, weekday/weekend). These features can be useful in capturing demand behaviours of products in a certain period of time.

Fig. 4 demonstrates an example of applying the MW approach (see Section IV-E) to include static and dynamic features in an input window. Now, the input window is a unified vector of past sales observations , static features , and dynamic features . As a result, in addition to past sales observations , we use the input window of the holidays , the input window of seasons , the input window of days of the week , and the input window of sub category types . Later, LSTM uses a concatenation of these input windows to learn the actual observation of the output window .

Fig. 4: Using both static , and dynamic features with the MW approach. All categorical variables are represented as “one-hot-encoded” vectors in the LSTM training data.

Vi Overall procedure

The proposed forecasting framework is composed of three processing phases, namely 1) pre-processing layer, 2) LSTM training layer, and 3) post-processing layer. Fig. 5 gives a schematic overview of our proposed forecasting framework.

As described in Section IV, we initially conduct a series of preprocessing steps to arrange the raw data for the LSTM training procedure. Afterwards, the LSTM models are trained according to the LSTM-LS1 and LSTM-LS2 learning schemes shown in Fig. 3. Then, in order to obtain the final forecasts, we rescale and denormalize the predictions produced by the LSTM. Here, the rescaling process back-transforms the generated forecasts to their original scale of sales, whereas the denormalization process (see Section IV-E) adds back the mean sales of the last input window to the forecasts.

Fig. 5: The overall summary of the proposed sales demand forecasting framework, which consists of a pre-processing, an LSTM training, and a post-processing part.

Vii Experiments

In this section, we describe the experimental setup used to empirically evaluate our proposed forecasting framework. This includes the datasets, error metrics, hyper-parameter selection method, benchmark methods and LSTM variants used to perform the experiments, and the results obtained.

Vii-a Datasets

We evaluate our forecasting framework on datasets collected from Walmart.com. Initially, we evaluate our framework on a subset of 1724 items that belong to the product household category, which consists of 15 different sub-categories. Next, we scale up the number of products to 18254 by extracting a collection from a single super-department, which consists of 16 different categories.

We use 190 consecutive days sales data in 2018. The last 10 days of data are reserved for model testing. We define our forecasting horizon as 10, i.e., training output window size is equivalent to 10. Following the heuristic proposed by [18], we choose the size of the training input window as 13 (10*1.25).

Model Parameter Minimum value Maximum value
LSTM-cell-dimension 50 100
Mini-batch-size 60 1500
Learning-rates-per-sample
Maximum-epochs 5 20
Gaussian-noise-injection
L2-regularization-weight
TABLE III: LSTM Parameter grid

Vii-B Error Measure

We use the mean absolute percentage error (MAPE) as our forecasting error metric. We define the MAPE as:

(3)

Here, represents the actual sales demand at time , and is the respective sales forecast generated by a prediction model. The number denotes the amount of sales data points in the test set, in which the length is equivalent to the intended forecasting horizon. Furthermore, to avoid problems for zero values, we sum a constant term to the denominator of (3).

In addition to the mean of the MAPEs (Mean MAPE), we also produce the median of the MAPEs (Median MAPE), which is suitable to summarise the error distribution in situations where the majority of the observations are zero sales, i.e., long tailed sales demand items.

MAPE (All) MAPE (G1) MAPE (G2) MAPE (G3)
k = 1724 k = 549 k = 544 k = 631
Model Configuration Mean Median Mean Median Mean Median Mean Median
LSTM.ALL LSTM-LS1/Bayesian/Adam 0.888 0.328 1.872 0.692 0.110 0.073 0.640 0.283
LSTM.ALL LSTM-LS1/Bayesian/COCOB 0.803 0.267 1.762 0.791 0.070 0.002 0.537 0.259
LSTM.ALL LSTM-LS2/Bayesian/Adam 0.847 0.327 1.819 0.738 0.103 0.047 0.582 0.326
LSTM.GROUP LSTM-LS1/Bayesian/Adam 0.873 0.302 1.882 0.667 0.093 0.016 0.604 0.283
LSTM.GROUP LSTM-LS1/Bayesian/COCOB 1.039 0.272 2.455 0.818 0.074 0.000 0.549 0.250
LSTM.GROUP LSTM-LS2/Bayesian/Adam 0.812 0.317 1.818 0.738 0.091 0.022 0.587 0.314
LSTM.FEATURE LSTM-LS1/Bayesian/Adam 1.065 0.372 2.274 0.889 0.135 0.100 0.738 0.388
LSTM.FEATURE LSTM-LS1/Bayesian/COCOB 0.800 0.267 1.758 0.772 0.069 0.000 0.533 0.255
LSTM.FEATURE LSTM-LS2/Bayesian/Adam 0.879 0.324 1.886 0.750 0.091 0.022 0.611 0.324
LSTM.CLUSTER LSTM-LS1/Bayesian/Adam 0.954 0.313 2.109 0.869 0.135 0.110 0.625 0.322
LSTM.CLUSTER LSTM-LS1/Bayesian/COCOB 0.793 0.308 1.695 0.748 0.077 0.005 0.562 0.302
LSTM.CLUSTER LSTM-LS2/Bayesian/Adam 1.001 0.336 2.202 0.863 0.084 0.017 0.664 0.347
EWMA _ 0.968 0.342 1.983 1.026 0.107 0.021 0.762 0.412
ARIMA _ 1.153 0.677 2.322 0.898 0.103 0.056 0.730 0.496
ETS (non-seasonal) _ 0.965 0.362 2.020 0.803 0.113 0.060 0.713 0.444
ETS (seasonal) _ 0.983 0.363 2.070 0.804 0.116 0.059 0.713 0.445
Naïve _ 0.867 0.250 1.803 0.795 0.124 0.000 0.632 0.250
Naïve Seasonal _ 0.811 0.347 1.789 0.679 0.086 0.000 0.523 0.320
TABLE IV: Results for category level dataset
MAPE (All items) MAPE (G1) MAPE (G2) MAPE (G3)
k = 18254 k = 5682 k = 5737 k = 6835
Model Configuration Mean Median Mean Median Mean Median Mean Median
LSTM.ALL LSTM-LS1/Bayesian/Adam 1.006 0.483 2.146 1.285 0.191 0.079 0.668 0.434
LSTM.ALL LSTM-LS1/Bayesian/COCOB 0.944 0.442 2.041 1.203 0.163 0.053 0.614 0.394
LSTM.GROUP LSTM-LS1/Bayesian/Adam 0.871 0.445 1.818 1.009 0.189 0.067 0.603 0.377
LSTM.GROUP LSTM-LS1/Bayesian/COCOB 0.921 0.455 1.960 1.199 0.173 0.053 0.618 0.394
LSTM.FEATURE LSTM-LS1/Bayesian/Adam 0.979 0.424 2.117 1.279 0.151 0.050 0.653 0.377
LSTM.FEATURE LSTM-LS1/Bayesian/COCOB 1.000 0.443 2.143 1.282 0.215 0.092 0.676 0.398
EWMA _ 1.146 0.579 2.492 1.650 0.229 0.091 0.805 0.562
ARIMA _ 1.084 0.536 2.305 1.497 0.198 0.094 0.734 0.510
ETS (non-seasonal) _ 1.097 0.527 2.314 1.494 0.204 0.092 0.755 0.509
ETS (seasonal) _ 1.089 0.528 2.290 1.483 0.204 0.092 0.756 0.510
Naïve _ 0.981 0.363 2.008 1.122 0.204 0.000 0.713 0.286
Naïve Seasonal _ 1.122 0.522 2.323 1.513 0.219 0.050 0.803 0.475
TABLE V: Results for super-department level dataset

Vii-C Hyperparameter Selection & Optimization

Our LSTM based learning framework contains various hyper-parameters, including LSTM cell dimension, model learning rate, number of epochs, mini-batch-size, and model regularization terms, i.e., Gaussian-noise and L2-regularization weights. We use two implementations of a Bayesian global optimization methodology, bayesian-optimization and SMAC [30] to autonomously determine the optimal set of hyper-parameters in our model [32]. Table III summarises the bounds of the hyper-parameter values used throughout the LSTM learning process, represented by the respective minimum and maximum columns.

Moreover, we use the gradient-based Adam [33] and COntinuous COin Betting (COCOB) [34] algorithms as our primary learning optimization algorithms to train the network. Unlike in other gradient-based optimization algorithms, COCOB does not require tuning of the learning rate.

Vii-D Benchmarks and LSTM variants

We use a host of different univariate forecasting techniques to benchmark against our proposed forecasting framework. This includes forecasting methods from the exponential smoothing family, i.e., exponentially weighted moving average (EWMA), exponential smoothing (ETS) [35], and methods from the moving average family, i.e., Autoregressive–moving-average model (ARIMA) [35]. Furthermore, we include standard benchmarks in forecasting, Naïve, and Naïve Seasonal. Some of these benchmarks are currently used in the forecasting framework at Walmart.com.

Furthermore, in our experiments, we add the following variants of our baseline LSTM model.

  • LSTM.ALL: The baseline LSTM model, where one model is globally trained across all the available time series.

  • LSTM.GROUP: A separate LSTM model is built on each subgroup of time series, which are identified by the domain knowledge available.

  • LSTM.FEATURE: The subgroup labels identified in the LSTM.GROUP approach is used as an external feature (one-hot encoded vector) of LSTM.

  • LSTM.CLUSTER: The time series sub-grouping is performed using a time series feature based clustering approach (refer Section IV). Similar to LSTM.GROUP, a separate LSTM model is trained on each cluster.

Vii-E Results & Discussion

Table IV and Table V show the results for the category level and super-department level datasets. Here, corresponds to the number of items in each group. We use a weekly seasonality in the seasonal benchmarks, i.e., ETS (seasonal), Naïve Seasonal. It is also noteworthy to mention that for the super-department dataset, we only employ one grouping strategy, namely LSTM.GROUP, and include only the best-performing learning scheme in the category level dataset, which is LSTM-LS1, to examine the robustness of our forecasting framework.

In the tables, under each LSTM variant, we present the results of the different learning schemes, i.e., LSTM-LS1 and LSTM-LS2, hyper-parameter selection methods, i.e., Bayesian and SMAC, and optimization learning algorithms, i.e., Adam and COCOB, and achieve comparable results.

According to Table IV, considering all the items in the category, the proposed LSTM.Cluster variant obtains the best Mean MAPE, while the Naïve forecast gives the best Median MAPE. Meanwhile, regarding G1, which are the items with most business impact, the LSTM.Cluster and LSTM.Group variants outperform the rest of the benchmarks, in terms of the Mean MAPE and Median MAPE respectively. We also observe in G1 that the results of the LSTM.ALL variant are improved after applying our grouping strategies. Furthermore, on average, the LSTM variants together with the Naïve forecast achieve the best-performing results within G2 and G3, where the product sales are relatively sparse compared to G1.

We observe a similar pattern of results in Table V, where holistically, the LSTM.GROUP variant gives the best Mean MAPE, while the Naïve forecast ranks as the first in Median MAPE. Likewise in G1, the LSTM.GROUP variant performs superior amongst other benchmarks, and in particular outperforms the LSTM.ALL variant, while upholding the benefits of item grouping strategies under these circumstances. Similarly, on average, the LSTM variants and Naïve forecast obtain the best results in G2 and G3.

Overall, the majority of the LSTM variants display competitive results under both evaluation settings, showing the robustness of our forecasting framework with large amounts of items. More importantly, these results reflect the contribution made by the time series grouping strategies to uplift the baseline LSTM performance.

Viii Conclusions

There exists great potential to improve sales forecasting accuracy in the E-commerce domain. One good opportunity is to utilize the correlated and similar sales patterns available in a product portfolio. In this paper, we have introduced a novel demand forecasting framework based on LSTMs that exploits non-linear relationships that exist in E-commerce business data.

We have used the proposed approach to forecast the sales demand by training a global model across the items available in a product assortment hierarchy. Our developments also present several systematic grouping strategies to our base model, which are in particular useful in situations where product sales are sparse.

Our methodology has been evaluated on a real-world E-commerce database from Walmart.com. To demonstrate the robustness of our framework, we have assessed our propositions on category level and super-department level datasets. The results have shown that our methods have outperformed the state-of-the-art univariate forecasting techniques.

Furthermore, the results indicate that E-commerce product hierarchies contain various cross-product demand patterns and correlations are available, and approaches to exploit this information are necessary to uplift the sales forecasting accuracy in this space.

References

  • [1] Hyndman, R. et al., 2008. Forecasting with Exponential Smoothing: The State Space Approach, Springer Science & Business Media.
  • [2] Box, G.E.P. et al., 2015. Time Series Analysis: Forecasting and Control, John Wiley & Sons.
  • [3] Box, G.E.P. & Cox, D.R., 1964. An Analysis of Transformations. Journal of the Royal Statistical Society. Series B, Statistical methodology, 26(2), pp.211–252.
  • [4] Yeo, J. et al., 2016. Browsing2purchase: Online Customer Model for Sales Forecasting in an E-Commerce Site. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, pp. 133–134.
  • [5] Ramanathan, U., 2013. Supply chain collaboration for improved forecast accuracy of promotional sales. International Journal of Operations & Production Management.
  • [6] Kulkarni, G., Kannan, P.K. & Moe, W., 2012. Using online search data to forecast new product sales. Decision support systems, 52(3), pp.604–611.
  • [7] Zhao, K. & Wang, C., 2017. Sales Forecast in E-commerce using Convolutional Neural Network. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1708.07946.
  • [8] Seeger, M.W., Salinas, D. & Flunkert, V., 2016. Bayesian Intermittent Demand Forecasting for Large Inventories. In D. D. Lee et al., eds. Advances in Neural Information Processing Systems 29. Curran Associates, Inc., pp. 4646–4654.
  • [9] Snyder, R., Ord, J.K. & Beaumont, A., 2012. Forecasting the intermittent demand for slow-moving inventories: A modelling approach. International journal of forecasting, 28(2), pp.485–496.
  • [10] Zhang, G., Patuwo, B.E. & Hu, M.Y., 1998. Forecasting with artificial neural networks: The state of the art. International journal of forecasting, 14(1), pp.35–62.
  • [11] Yan, W., 2012. Toward automatic time-series forecasting using neural networks. IEEE transactions on neural networks and learning systems, 23(7), pp.1028–1039.
  • [12] Zimmermann, H.-G., Tietz, C. & Grothmann, R., 2012. Forecasting with Recurrent Neural Networks: 12 Tricks. In Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 687–707.
  • [13] Trapero, J.R., Kourentzes, N. & Fildes, R., 2015. On the identification of sales forecasting models in the presence of promotions. The Journal of the Operational Research Society, 66(2), pp.299–307.
  • [14] Borovykh, A., Bohte, S. & Oosterlee, C.W., 2017. Conditional Time Series Forecasting with Convolutional Neural Networks. arXiv [stat.ML]. Available at: http://arxiv.org/abs/1703.04691.
  • [15] Flunkert, V., Salinas, D. & Gasthaus, J., 2017. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. arXiv [cs.AI]. Available at: http://arxiv.org/abs/1704.04110.
  • [16] Wen, R. et al., 2017. A Multi-Horizon Quantile Recurrent Forecaster. arXiv [stat.ML]. Available at: http://arxiv.org/abs/1711.11053.
  • [17] Chapados, N., 2014. Effective Bayesian Modeling of Groups of Related Count Time Series. In E. P. Xing & T. Jebara, eds. Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research. Bejing, China: PMLR, pp. 1395–1403.
  • [18] Bandara, K., Bergmeir, C.& Smyl, S., 2017. Forecasting Across Time Series Databases Using Recurrent Neural Networks on Groups of Similar Series: A Clustering Approach. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1710.03222.
  • [19] Elman, J.L., 1990. Finding Structure in Time. Cognitive science, 14(2), pp.179–211.
  • [20] Williams, R.J. & Zipser, D., 1995. Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications, 1, pp.433–486.
  • [21] Bengio, Y., Simard, P. & Frasconi, P., 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 5(2), pp.157–166.
  • [22] Hochreiter, S. & Schmidhuber, J., 1997. Long Short-Term Memory. Neural computation, 9(8), pp.1735–1780.
  • [23] Sutskever, I., Vinyals, O. & Le, Q.V., 2014. Sequence to Sequence Learning with Neural Networks. In Z. Ghahramani et al., eds. Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp. 3104–3112.
  • [24] Gregor, K. et al., 2015. DRAW: A Recurrent Neural Network For Image Generation. arXiv [cs.CV]. Available at: http://arxiv.org/abs/1502.04623.
  • [25] Graves, A., r. Mohamed, A. & Hinton, G., 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. ieeexplore.ieee.org, pp. 6645–6649.
  • [26] Lipton, Z.C. et al., 2015. Learning to Diagnose with LSTM Recurrent Neural Networks. Available at: http://arxiv.org/abs/1511.03677.
  • [27] Ben Taieb, S. et al., 2012. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition. Expert systems with applications, 39(8), pp.7067–7083.
  • [28] Ord, K., Fildes, R.A. & Kourentzes, N., 2017. Principles of Business Forecasting.2nd ed, Wessex Press Publishing Co.
  • [29] Abadi, M. et al., 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv [cs.DC]. Available at: http://arxiv.org/abs/1603.04467.
  • [30] Hutter, F., Hoos, H.H. & Leyton-Brown, K., Sequential Model-Based Optimization for General Algorithm Configuration (extended version).
  • [31] Fernando, 2017. bayesian-optimization: Bayesian Optimization of Hyper-parameters, Github. Available at: https://bit.ly/2EssG1r [Accessed November 3, 2018]
  • [32] Snoek, J., Larochelle, H. & Adams, R.P., 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In F. Pereira et al., eds. Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 2951–2959
  • [33] Kingma, D.P. & Ba, J., 2014. Adam: A Method for Stochastic Optimization. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1412.6980.
  • [34] Orabona, Orabona, F. & Tommasi, T., 2017. Training Deep Networks without Learning Rates Through Coin Betting. arXiv [cs.LG]. Available at: http://arxiv.org/abs/1705.07795.
  • [35] Hyndman, R.J. & Khandakar, Y., 2008. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3), pp.1–22. Available at: http://www.jstatsoft.org/article/view/v027i03.
  • [36] Hyndman, R.J. et al., 2018. Time series features R package. Available at https://bit.ly/2GekHql.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
331181
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description