ARMDN: Associative and Recurrent Mixture Density Networks for eRetail Demand Forecasting
Abstract
Accurate demand forecasts can help online retail organizations better plan their supplychain processes. The challenge, however, is the large number of associative factors that result in large, nonstationary shifts in demand, which traditional time series and regression approaches fail to model. In this paper, we propose a Neural Network architecture called ARMDN, that simultaneously models associative factors, timeseries trends and the variance in the demand. We first identify several causal features and use a combination of feature embeddings, MLP and LSTM to represent them. We then model the output density as a learned mixture of Gaussian distributions. The ARMDN can be trained endtoend without the need for additional supervision. We experiment on a dataset of an year’s worth of data over tensofthousands of products from Flipkart. The proposed architecture yields a significant improvement in forecasting accuracy when compared with existing alternatives.
ARMDN: Associative and Recurrent Mixture Density Networks for eRetail Demand Forecasting \vldbAuthorsSrayanta Mukherjee, Devashish Shankar, Atin Ghosh, Nilam Tathawadekar, Pramod Kompalli, Sunita Sarawagi, Krishnendu Chaudhury \vldbDOIhttps://doi.org/TBD
1 Introduction
In the retail industry, accurate sales forecasts are essential for timely buying and replenishment of inventory, topline demand planning and effective supplychain utilization [15, 19]. Errors in predicting demand adversely affects multiple business critical metrics such as outofstock percentage, inventory health, wastage of man power and logistical resources. In this paper we present how we tackle the challenges of demand forecasting for inventory management in Flipkart, India’s largest eRetail company.
In an eRetail setting, demand prediction is particularly challenging, given the huge catalog of products numbering in tens of millions, sale events (such as festival sales), frequent discounting of price, geographically distributed customer base, and competitive interactions On the positive side, due to the pervasive nature of eRetail, its huge customer bases, and “online” mode of transactions, there is a wealth of data that could be exploited to build accurate models for demand prediction. For example, Flipkart has an active catalogue of 80 million products, a daily footfall of 10 million page visits, and sells an average 0.5 million units per day.
Much of previous demand prediction methods relied on statistical timeseries methods such as ARIMA and other state space models that capture parameters like seasonality and trends. However, traditional timeseries models are incapable of taking into account a number of associative factors that have strong influence on short term buying decisions, including factors like deepdiscounting, bundle offers, aggressive merchandising campaigns, and competitor trends [10]. These factors lead to wild fluctuations on a weekly basis with sharp spikes. The nontrivial nature of the problem dictates the use of advanced machine learning approaches which are capable of handling the multidimensional feature space that drives demand. Our first method of choice was a Boosted Cubist [27] model that is a powerful regression model that combines four learning ideas: a regression tree to capture nonlinear feature interaction, a linear regression model at the leaves, a nearest neighbor refinement, and finally a Boosted ensemble. This, and related models like Gradient Boosted Decision Trees have been found to be highest performing in many prediction challenges[1].
A failing of Cubist [27], and similar regressors, is that it is a scalar prediction model and requires special preprocessing to capture sequential predictions on a timeseries. Also, motivated by the impressive success of modern deep learning methods on various speech, text, and image processing tasks, we next focused on a neural network model for our demand forecasting problem. However, we soon found that a straightforward deployment of recurrent neural networks for time series forecasting failed to provide us any gains beyond existing regression models (like Boosted Cubist) on Flipkart’s challenging dataset. In this paper we present the many design and engineering challenges that we had to tackle in implementing a modern deeplearning model, that substantially outperformed our first generation Boosted Cubist model.
Our multilayered hybrid deep neural network, called Associative and Recurrent Mixture Density Networks (ARMDN), can effectively combine both timeseries and associative factors. ARMDN outputs the probability distribution over demands as a mixture of Gaussians, and we show that this is crucial to capture the unpredictable trends in demands in heterogeneous settings such as ours. We provide engineering solutions for training models along hierarchies of products and geography, and demonstrate the role of our carefully designed feature sets.
1.1 Problem Setting
Our point of sales (POS) data contains with each sales transaction a number of properties that potentially influenced that purchase event. These span various properties of the item sold, the geography where sold, and events/properties characterizing the time of sale. When creating a forecast model the first key decision is choosing a level of granularity of representing the dimensions of item, geography, and time.
In this work, we focus on demand prediction for inventory management and replenishment, where the demand needs to be estimated at the SKU level. The inventory is typically replenished on a weekly basis, hence, we aim to generate a weekly forecast of demand. Further, the supply chain can perform efficiently if the customer demands are met from the nearest warehouse. Therefore, along the geography dimension we assess demand at the warehouse or âfulfillment centerâ (FC) level. Succinctly, we aim to make predictions at the granularity of SKU FC Week level along the item, geography, and time dimensions respectively.
1.2 Contributions
We highlight the main contributions of this work here.
We present the design of ARMDN, a deeplearning based forecast model that can grapple with the scale and heterogeneity of the largest eRetail store in India. Our proposed model incorporates a number of careful design ideas including a mixture of Gaussians at the output layer, a practical training methodology to tradeoff sparsity with diversity, and a staged feedforward, recurrent layer to fuse associative with temporal features.
We compare this network with the best of the breed traditional method based on boosted hybrid trees. We are aware of no earlier such study at the scale of a challenging retail dataset such as ours.
Our experiments on Flipkart’s sales data show that a wellengineered deep learning models does indeed live up to their hype of being significantly better than a state of the art nondeep learning based method. ARMDN causes between 10 to 25% reduction in error compared to Boosted Cubist.
We show that our specific neural architecture is significantly better than offtheshelf deep learning time series models. Somewhat surprisingly we observe that feature engineering continues to be useful even for deep learning models and we get a 720% improvement in performance by our engineered features.
1.3 Outline
The rest of the paper is organised as follows. In Section 2 we present an overview of the extensive literature on forecasting models. In Section 3 we present ARMDN, our proposed deep learning based model. In Section 4 we compare the performance of our proposed model with a stateoftheart traditional model for time series forecasting, and alternative deep learning models. We conclude and provide future directions in Section 5.
2 Literature Review
We categorize the massive literature of time series forecasting into four groups and present an overview of each.
Traditional time series methods Traditional time series methods like BoxJenkins [4, 5] and variations of the state space models like ARIMA and HoltWinters[14], have been used for demand forecasting in the industry for a very long time. The reason for their popularity is their simplicity. While such methods produce admirable results in many cases, for retail demand forecasting, these methods perform poorly. A primary reason for their failure is the presence of several exogeneous casual factors, that make the demand curve highly erratic with sharp peaks and valleys. Further, the sales time series is often nonstationary and heteroscedastic, thus violating core assumptions of the classical methods. Additionally, in large eRetail stores like Flipkart, the product landscape is huge, making the demand curve a function not just of itself but substitutable and complementary products as well. To alleviate these problems, a number of innovations have been attempted in recent years that we review next.
Waveletbased approaches A popular approach to tackle nonstationarity in timeseries data, is Wavelet Transforms (WT) [40, 38, 25]. The advantage of WT is that it can effectively decompose the time series into its component frequency and time, thereby allowing for identification of primary frequency components. For example, for financial time series prediction, multiple approaches [3, 24] used the WT, modeling the market as fractional Brownian motion. The WT can also be used as a preprocessing to multiple methods such as Artificial Neural Network (ANN) [42], Kalman filtering [9] and an AR model [33]. A variant was proposed in Stolojescu et al. [34] of using Stationary WT (SWT) as a preprocessing step to ARIMA, ANN, linear regression and random walk. Overall, the ANN combined with the SWT performed the best among the four models, effectively capturing the nonlinearity. They also compared various mother wavelets and concluded that the Haar and Reverse Biorthogonal were most suitable for the task. Zheng et al. [42], Soltani et al. [33] and Renauld et al. [31] have also explored the use of Haar mother wavelet in timeseries forecasting problems. However, wavelet based methods were unsuccessful in producing acceptable results in our case primarily due to their inability to account for the causal ( associative) features that lead to sharp changes in demand.
Regressionbased methods Another approach is to use generic machine learning regressors along with autoregressive features to capture the timeseries [2]. A comparison of the different classes of Regression methods can be found in the work of Wang et al. [37] and Torgo [35]. Regression methods rely on minimizing the sum of squared errors or other similar objective functions while achieving a balance between the bias and variance of the model. Particularly robust and highperforming models have traditionally been i) Support Vector Regression (SVRs) that uses nonlinear kernel functions ii) Random forests, that averages over independent individual decision trees and iii) Gradient boosting which iteratively reduces residual errors using new models.
Cubist [27] is a hybrid of decision trees (or rules) and linear regression models where the predictions are further refined with a nearest neighbor adjustment. The Cubist regression method has also been found to perform admirably, as compared to other popular regression techniques in forecasting problems as highlighted in the works of Zhang et al. [41], Meyer et al. [22] and Walton [36]. However, Cubist and other regression based methods are not designed to explicitly model timeseries data. They instead extract intermediate features based on the timeseries, which are then treated as independent variables while learning the model.
Deep Learning Methods Deep Neural Networks have recently been proven to have tremendous potential in modeling tough classification and regression problems. One of the first work on timeseries based forecasting using deep neural networks was by Busseti et al. [6] for energy load forecasting. They implemented and compared multiple methods including a deep feedforward network and a deep recurrent network (RNN), amongst which the RNN was found to perform the best. Similarly Shi et al. [32] extended a Fully Connected LSTM (FCLSTM) to have convolution like features (ConvLSTM) for the problem of precipitation forecasting over a localized region for a short time period in the future. Their experiments were able to capture spatiotemporal features better and outperformed more traditional methods. Another deep learning methodology that has been used for timeseries forecasting problems are deep belief networks (DBN). Qiu et al. [26] applied an ensemble of DBNs aggregated together by a Support Vector Regression (SVR) model to the problem of energy load forecasting. Takashi Kuremoto used multiple Restricted Boltzmann machines (RBM) optimized via Particle Swarm Optimization (PSO) [18]. Another innovative approach was proposed by Mocanu et al. [23] who used a Factored Conditional RBM (FCRBM) for energy load forecasting. Their method also used additional associative features like electricity price and weather conditions other than the time series only.
In retail forecasting, which requires a probabilistic forecasting output, results have mostly been mixed using modern deeplearning techniques [16]. Recently though, Flunkert et al. [11] used a probabilistic LSTM, DeepAR which was both autoregressive and recurrent and was able to demonstrate significant improvements. Key to their success was to learn a global model across the timeseries of all items instead of relying on the timeseries of a single product alone. This is similar to the strategy that we follow in this work. Second, they propose to model the output on a Gaussian likelihood for realvalued data and negative binomial likelihood for positive count data. In contrast, our approach is to use a mixture of Gaussians as the output density model. Importantly, this article holds particular significance for the field as they were able to effectively demonstrate the applicability of modern deep learning techniques on the problem of demand forecasting for eRetail.
3 Our Prediction Model
In this section we present a deep learning based solution to the challenging task of demand forecasting in large online retail stores. We start by formally defining the demand forecasting task. Then in Section 3.2 we describe the architecture of ARMDN, our proposed forecasting model. In Section 3.3 we describe the training process.
3.1 Problem Formulation
For each SKU “” and geographical region “”, we have a time series of weekly demand values up to a point of time “” measured in weeks
Features
Our features capture a large variety of properties of the product, the time, the product’s price at the time, advertisements surrounding the sales event, and so on. Table 1 presents the features we extracted from our point of sales data grouped into five different clusters. Further, a feature could be numerical, categorical, or binary. Some of these are raw features of the data, e.g. the price of the product but we also have a large number of interesting features derived based on our intuitive understanding of consumer behavior. For example, consider the derived feature called “time since last price change”. We observed that when the price of an SKU is decreased from to , the demand increases from to () for only a brief time period. Beyond this time period, even if the price is maintained at , the demand rate decays quickly from and stabilizes back to . In Section 4.3, we will show the tremendous impact of these derived features.
We use “” to index a time series corresponding to a combination and denote its feature and demand values along time as where each denotes the vector of all associative features captured at time for time series and denotes the corresponding demand. Thus, our available dataset can be expressed as a set of time series of the form
in our setup is very large and is of the order of several million.
For forecasting, we are given a set of consecutive future time horizons and the corresponding input features for them. We need to predict the corresponding demand values . In order to capture the uncertainty associated with these predictions, we propose to output a distribution instead of a fixed value. We use a parametric model that trains parameters to predict such a distribution as
(1) 
3.2 The ARMDN Architecture
We propose a new architecture for demand forecasting. The proposed model combines a multilayer perceptron (MLP) and a longshort term memory (LSTM) network for simultaneous handling of associative and timeseries features, and a mixture density network at the output to flexibly capture output uncertainty. The final model, ARMDN, is therefore associative, recurrent and probabilistic. Figure 1 shows a representation of the overall deep net architecture. In what follows, we present detailed descriptions of the three main stages of our model.
Demand Factor  Feature Name  Description  Data Type 
Product  Product ID  Captures “latent” factors of the product  1Hot 
Product Tier  Tiers are based on popularity of products  Categorical  
Historical OutofStock %  Assumed to be always InStock during Testing phase  Numerical  
Product Visibility  Sale event type  Categorylevel of the Sale event  Categorical 
Deal Card  Binary  
Banner on Homepage  Binary  
Price  Listed Price  Govt. mandated Max. Retail Price  Numerical 
Discounted Price  Price offered by Flipkart  Numerical  
Effective Price  Final price after cashback, product exchange, freebie, etc.  Numerical  
Difference in price from historical mean  Captures “stable” price of the product  Numerical  
Historical Min Price  Numerical  
Historical Max Price  Numerical  
Avg. discount on similar products  Products in the same category are considered similar  Numerical  
Convenience  Nocost EMI  Binary  
Product Exchange  Binary  
Exclusive to Flipkart  Binary  
Time  Week of the month  Captures seasonality  Numerical 
Lag feature  Price  Mean of past weeks’ price  Numerical  
Lag feature  Sale  Mean of past weeks’ sale units  Numerical  
Time elapsed since last price change  Sales peak when price changes, but reverts quickly to previous volume  Numerical  
Time since last event or upcoming event  Units sold dips immediately before and after a Sale event  Numerical  
Time since first order  Captures “Newness” of Product  Numerical 
The Associative Layer
The primary motivation behind this layer was to treat the associative variables that drive demand. Thus, the associative layer is functionally analogous to the regressionbased models discussed in Section 2. The model is executed independently for each time point, the output of which,
where are the parameters of the MLP.
A representative architecture of the feedforward neural network that models the Associative layer is presented in Figure 2. The features are first bucketed into five types as illustrated in Figure 2 and Table 1. The features under the “Price” and “Time” buckets are all continuous and are simply normalized to zero mean and unit variance. The rest of the feature buckets are composed of categorical and 1Hot features which are embedded into a suitable feature space of size . Following this, the entire set of features are concatenated and embedded into a lowerdimensional space of . The final embedding was executed using a single FClayer with an Exponential Linear Unit (ELU) activation, which was shown to have better numerical stability than the popular ReLU [7] .
The Recurrent Layer
As mentioned earlier, the weakness of Regression based models is that they do not explicitly model the timeseries data. The model needs to capture both the shortterm and longterm signals that are present in a timeseries data. To achieve this, the output of the Associative layer is fed into a recurrent neural network (RNN). A popular type of RNN is LSTM that associates each cell with an input gate, an output gate and a forget gate. The memory unit of an LSTM cell stores the hidden representation of the sequence of input data seen till a given time instant. The LSTM cell learns to either update or reset the memory unit depending on the given sequence of values. The LSTM model integrates the output of the associative model along with all previous demand and inputs along the timeseries as:
(2) 
where are the parameters at this stage. Internally, the variable length history is stored in LSTMs as state at each point . At each step, the update in terms of the previous and current input ff then becomes:
(3) 
In the above is the vector that is fed to the next output layer and is the state that is stored in this layer and used for computing the outputs at the next time step. We can view as the autoregressive features that has been automatically computed by the deep network. Hence, by combining the ability of the MLP to factor in associative features with the explicit treatment of the sequential nature of a timeseries via an LSTM, ARMDN is able to harness the signals embedded within the data better than either of the layers alone.
The MDN Layer
The output of a typical Neural Network minimizes a meansquared or a softmax loss function. However, neither of these losses are suitable towards modeling the variation in the demand curve, as can be seen in the example of Figure 4(a). Minimizing the sum of squared error yields an output that follows a single Gaussian distribution, which would not be powerful enough to model the variation in the output space. Softmax, on the other hand, is a generalization of the sigmoid function that models a Multinoulli output distribution. Softmax is more suited to classification tasks rather than the regression tasks that we are interested in.
Instead, the central hypothesis used for modeling the output distribution in our case, is that the regression being performed is multimodal. As see in Figure 4(b), the density plot for the timeseries data in Figure 4(a) is clearly multimodal, which results in multiple values of for similar values of .
Since Gaussian mixtures is the natural representation for this kind of problems, we use a mixture density network (MDN) as an output layer [21, 39]. The MDN hypothesis is intuitively robust considering the many external factors that are not accounted for in the model. Therefore, considering Gaussian mixtures, the conditional distribution can be equated as
(4) 
where , and are the probability, mean and standard deviation of the Gaussian component respectively. These parameters of the GMM are computed from output from the recurrent layer as follows: (We use instead of to reduce clutter.)
(5) 
(6) 
(7) 
where , and are the learned parameters of the output layer corresponding to the mixture probability, variance and mean respectively. The use of the softmax function in Equation 5 ensures that the values of lie between 0 and 1 and sum to 1 as is required for probabilities.
Our model distribution can be factorized as a product of likelihood factors over time. The conditional distribution of the network is therefore trying to model is a function of the ARMDN output , given in Equation 3, and in likelihood terms can be expressed as
(8) 
Thus, from the last layer, we get a welldefined joint probability distribution over the demand values for all times periods in the forecast horizon.
Test set wMAPE: ARMDN  

Test Window  Week 1  Week 2  Week 3  Week4  Overall 
Weeks 912  22.90  27.6  31.06*  36.06  29.58 
Weeks 1316  31.98  35.71*  35.97  37.02  35.17 
Weeks 1518  30.86  31.90  32.74  37.01*  33.13 
Average  28.58  31.74  33.26  36.93  32.63 
Test set wMAPE: Boosted Cubist  
Weeks 912  28.51  33.21  38.63*  43.26  35.63 
Weeks 1316  35.33  39.87*  44.12  46.17  41.70 
Weeks 1518  32.76  36.42  41.17  43.73*  39.98 
Average  32.30  35.50  41.31  44.39  39.06 
3.3 Training and Implementation
Given the scale and diversity of products sold in a large eRetail store like Flipkart, a single set of model parameters over all products in the store is not appropriate, given the large fluctuations in the sale patterns of products within the store. The other extreme of training a separate model for each SKU raises challenges of data sparsity, particularly for new and emerging products. We next discuss our strategy of balancing the conflicting challenges of product heterogeneity and data sparsity in training a learned model.
Trading data heterogeneity and sparsity during training We first trained a joint model, combining all SKUs across all verticals. A “vertical” is a set of products that belong to the same category. Example verticals could be Laptops, Headphones, etc. We then finetuned this model per vertical, training with only the SKUs within the vertical. To augment the relative sparsity of data at the SKU level, all SKU’s belonging to a common vertical (which are relatively homogeneous) are trained together.
Along the geography dimension, we handle hierarchies a bit differently. We train first total demand at a national level which are in a subsequent step distributed to the FC level according to historical ratios. This is necessitated by the fact that data at the FC level is quite sparse and the causal factors like offers and additional visibility is applied only at the national level and not implemented at regional or FC levels. Historical sales data is available at regional granularity. We remove the outliers from the data and based on 8 weeks of history, compute the percentage sales contribution of each region with respect to overall national demand. Using these ratios, we divide the national level forecast into regional level sales demand.
Loss function Since the last layer outputs a probability distribution over the demand values, we can resort to the maximum likelihood principle for training the model parameters. Given the parameter definition in the output layer, the loss function of the model is then given as
(9) 
We evaluate this loss for each product’s national level sales at each point in the training week. Training is done using standard stochastic gradient steps.
MDN loss is known to be highly unstable, essentially because of the division by sigma in Equation 9 [12]. Since is a predicted value of the Neural Network, it can become arbitrarily small. This can lead to exploding activations, and consequently exploding gradients. To prevent this we employed activation clipping. The value of sigma was clipped to the range . This allowed us to successfully train the model. The number of mixtures were empirically chosen to be 10 by hyperparameter tuning.
4 Experimental evaluation
In this section we show a detailed evaluation of different forecasting models on Flipkart’s operational data. We compare our best effort traditional model (Boosted Cubist) with a modern deep learning based model (ARMDN). We are aware of no earlier comparisons of these two families of models on largescale industrial setup such as ours. Next, we justify various design choices in the ARMDN model by showing comparisons with existing simpler variants. Our experiments are particularly significant because the rich diversity of our feature space, stresses the limits of stateoftheart approaches in a way that smaller, public domain datasets cannot.
Evaluation metric Since the downstream business usecase of the forecasting pipeline is buying and replenishment of the inventory, forecasting errors for SKUs which sell a larger number of units have larger negative impact than equivalent error on SKUs which sell lesser units. Hence, the error metric chosen was weighted mean absolute percentage error (wMAPE) at time and is given by the equation:
(10) 
where is the total number of SKUs, is the actual number of units sold at time and is the model forecast at time .
However, in an industrial setup, we are interested in not just aggregated errors but also need to dissect the aggregates along various facets of product and time.
Training and Test Data The data used for our experiments is collected from October 2015 to March 2017. We evaluate the model on three distinct test periods of a duration of 4 weeks each: weeks 912, 1316, and 1518 of 2017. For each test data set, the remaining data (from October 2015 up to the test period) is used as training data. For each experiment, for example week 1516, demands for all weeks before week 14 of 2017 are fed to the ARMDN, and predictions obtained for week 1, 2, 3, and 4 after that are compared with actual demand at weeks 15, 16, 17, 18 respectively.
Implementation The ARMDN was implemented using TensorFlow, and was trained on a single Titan X GPU with 3072 cores and 12 GB RAM. We used Adam Optimizer and an exponential learning decay policy, decaying learning rate by 0.96 every 1000 iterations. Training the combined model took around 12 days, and per vertical finetuning took an additional 45 days. The total training process, therefore, took around a week. This time includes evaluation that was carried out after each epoch. Running the trained model takes 100ms/SKU on a 4core Intel Xeon CPU machine. We also used dropout of 0.5 after the MLP, and before the LSTM input cells, to prevent overfitting. We used mini batch gradient descent, sampling 512 SKUs for each minibatch randomly. The sequence length was chosen to be the maximum sequence length in the mini batch. Zero padding had to be done for rows, which had less data than sequence length of the batch. We later masked the loss for these zero padded data points. LSTM was dynamically unrolled for this sequence length.
Methods compared In Section 2 we discussed a number of methods for demand forecasting. We present comparisons with existing methods in two stages: we first select the best of the breed nondeep learning based method and compare with that. Next, we compare with various architectures among deep learning models.
4.1 Traditional Machine Learning Alternatives
We eliminated pure timeseries based models like ARIMA early on because of their inability to capture the effect of the large number of exogeneous factors causing large swings in outputs. Among the regression methods, the best performing method was a committee of Cubist [27] models. Cubist[17] is an extension of Quinlan’s M5 model [27, 29, 28] model that skillfully combines three learning structures: decision trees at the toplevel to capture nonlinearity, regression models at the leaves to encourage smooth predictions, and finally nearest neighbor based refinement. We trained a committee of Cubist models via Boosting. The values of and were fixed at 9 and 50 via timeslice validation. The Cubist models used the same set of features as ARMDN as outlined in Table 1. However, since Cubist is a scalar regression model, and not a time series model, we created a set of autoregressive features and temporal features along time. Specifically we used autoregressive features like mean sale of the previous week and mean sale of the previous month; and temporal features like week of year and time since first sale/launch.
Table 2 shows the comparison of the results obtained by the ARMDN model and the BoostedCubist model across four weeks for three different test windows. Overall, ARMDN has a wMAPE between across the three sets of experiments whereas Boosted Cubist has a higher error of . The ARMDN model consistently outperforms the Cubist model both of which greatly outperform the ARIMA model (data not shown).
Another interesting trend is the difference in error between the Cubist model and ARMDN model increases as the forecasting horizon increases. For the 1weekout forecast the difference in wMAPE is in the range of which increases to for the 4weeksout forecasts.
The performance is also slightly worse during weeks where a multiday large sale event occurred (weeks 11, 14, and 18) as opposed to the “business as usual” weeks i.e. weeks which did not feature large sale events. We will discuss further on such event weeks in Section 4.4.
Actionable Performance
Another important parameter for success called “hits” can be defined as the number of SKUs for which the forecasts are actionable i.e. has average percentage error less than across all three testing windows. The cutoff of was chosen based on business inputs. For ARMDN /of the SKUs are “hits” for 1weekout and 4weeksout forecasts respectively while the corresponding numbers for the Cubist model are /. Hence, a similar trend holds that the ARMDN outperforms the Cubist and difference is more notable when the prediction time horizon is larger.
Finally, we measure the quality of the models at the SKU level. In Figure 6 we show the number of SKUs which have lesser than cutoff error across weeks. Based on Figure 6, it can be concluded that almost of the SKUs are actionable and the replenishment of these can be completely automated based on the described models.
VerticalLevel Results
While the models are quite robust at the SKU level, it was imperative to understand whether the models performed better across certain verticals (products belonging to the same category), and how consistent the performance was across the product classes. In Figure 5 we show the performance of both models at a vertical level, where it becomes clear that performance is fairly consistent for both models. However, out of the 40 verticals, ARMDN does better than the Cubist model in 27 out of 40 verticals. Since, verticals are a homogeneous mix of SKUs, and are a “decision making unit” from business process perspective, this is a significant data point.
The decision on which of the two models would be adopted as a business process depends on the overall performance of the models on a vertical. Given that the number of SKUs are numerous it is not practically possible to make the choice of model at an SKU level. Therefore, given the performance of ARMDN, it would be more widely integrated into the business process as opposed to Cubist.
Performance on Representative SKUs
In Figure 7 we show the performance of ARMDN and Cubist on a representative set of 6 SKUs for test window 3. In the first 5 SKUs (Figure 7 (a)(e)), the timepoint or Week 18 contains a large sale event. In general, all SKUs in our dataset show sharp increase in demand during an event week, a trend observed here as well. However, these 6 SKUs represent slightly different classes in terms of sale trend and we discuss them individually below. For confidentiality reasons, we are unable to reveal the actual product corresponding to each SKU.
 (a)

This SKU is representative of products that have very low demand during businessasusual (BAU) weeks. However, they show ordersofmagnitude jump in demand during a sale event, in response to the associative factors that drive a large event. ARMDN clearly outperforms Cubist in picking up this drastic change of demand. While Cubist does pick up the increasing trend, it falters in correctly picking the magnitude of increase.
 (b)

This SKU represents those that generally show moderate but steady sales during a BAU week and then the demand jumps sharply during the event week. Here, while both ARMDN and Cubist do well in picking the trend during the nonevent time points, ARMDN clearly outperforms Cubist in estimating the magnitude of the demand jump during the event week.
 (c)

Represents SKUs that have fluctuating demand but shows sharp jumps just prior to an event (around the time frame when marketing and advertising campaigns begin). The increasing trend is maintained during the event week. These SKUs typically see “warmup” offers before the actual event, to cater to the interests of additional visitors to the website in response to marketing campaigns. Both Cubist and ARMDN are successful in modeling the preevent week rise in demand. However, while ARMDN correctly detects the continuing increasing tread correctly, Cubist fails to do so.
 (d)

This example is representative of SKUs which are relatively new products. They show and overall increasing trend, which is then boosted by a sale event. Both Cubist and ARMDN are moderately successful, with ARMDN doing slightly better, in predicting the nonevent and event week demand.
 (e)

This SKU is representative of products that display wildly fluctuating demands on a weekly basis with sharp peaks and deep valleys due to various known and latent associative factors. Event weeks show sharp rises but the magnitude of the rise may have been observed during nonevent weeks as well. Cubist fails in detecting the inherent fluctuating trends and forecasts a steady decrease while the ARMDN does a much better job in modeling the sharp rises and falls.
 (f)

This example represents SKUs which generally have low sales during nonevent weeks, participates but not strongly during an event week but rather tend to have their own ”mini” events. Once again, Cubist fails to pick up the idiosyncratic nature of demand for this SKU while ARMDN forecasts fairly accurately.
While these 6 SKUs are only a snapshot of the variety of demand patterns in our dataset, they clearly highlight the advantage of ARMDN across a large cluster of products.
4.2 Comparison with Alternative Neural Model Architectures
The ARMDN model is composed of three modular subparts; i) MLP as the associative layer, ii) LSTM as the recurrent layer, and iii) MDN as the output layer. As stated above, the rationale for this modular architecture was that the MLP is useful for the associative features affecting demand, the LSTM handles the sequential aspects of demand while the MDN is necessary to handle the multimodality of the output space. The question naturally arises whether, the hypothesis is in fact correct and whether all three subparts are integral to the performance of the machine. Further, how much is performance affected without a particular subpart.
To quantify the above a series of experiments were performed by removing one component at a time from the ARMDN model. Therefore we developed a RMDN (no MLP), AMDN (no LSTM), AR (single Gaussian) and compared against the complete ARMDN model. Existing deep learning models for time series predictions are based purely on a LSTM with neither the mixture of Gaussian or the multilayer associative model [11].
Figure 8 compares the error of ARMDN with these alternative architectures for the weeks 1518 test window (the same trends holds for other test windows as well). From these experiments we can make the following observations:

We clearly see that the complete ARMDN machine outperforms all its variants by varying degrees.

A model without the associative layer (RMDN) has between 1% to 7% higher error than full ARMDN. The associative layer has a greater role in forecasts further out in the future than in the next week.

The recurrent layer also has an important role in reducing error and its removal causes error to increase from 0.5% to 8%.

The mixture of ten Gaussians in the output layer is significantly better than a single Gaussian across all four forecast horizons.
Overall, the experiments prove that each of the layers of the hybrid ARMDN architecture is integral to the performance and the layers working in tandem unleash the full modeling prowess of the machine.
4.3 Effect of Engineered Features
Deep learning promises to remove the need for feature engineering [20], and on many speech and image processing tasks that promise has been largely met [13, 30]. In these experiments we show that for our environment of highly heterogeneous features and products, features engineered by domain experts continue to be extremely useful.
In Section 3.1.1 we described a number of derived functions. These features are used to capture inherent and easily derivable properties in the time series and price and discounting related observations. In Figure 9 we plot error with and without our hand engineered features. We find that these features had a significant impact in reducing error. These features are particularly useful for predictions further into the future where the recurrent features tend to be less useful.
4.4 Performance During “Event” Weeks
Flipkart tends to hold small to large scale events frequently, spanning multiple days, which are characterized by larger than usual discounts, aggressive marketing campaigns and strong wordofmouth advertising. The “event” weeks also tend to have higher number of unknowns or latent variables like regional festivals, matching competitor events, launch of a popular and coveted product, etc. Importantly, these weeks are also a time for maximizing opportunity (high sale numbers) and when the supplychain is most stressed due to the higher volumes. Business planning (for example, inventory buying) for these events also happen earlier than usual, generally 24 weeks in advance. Therefore, error in forecasts (which need to be more than 1weekout) during the event weeks, disproportionately affects business critical metrics, as compared to ”businessasusual” weeks. In our test windows, weeks 11 (3weeksout in test window 1), 14 (2weeksout in test window 2), and 18 (4weeksout in test window 3) contain such events.
In the comparison with Cubist in Table 2, it is amply clear that ARMDN is the better of the two models for all 3 weeks that contain a major sales event. The average wMAPE for these 3 weeks are and for ARMDN and Cubist, respectively. Given the volume of sales that are observed on these weeks, this difference in performance alone is sufficient motivation for the production scale use of ARMDN.
In the comparison with alternative neural architectures, the insights obtained are even more interesting and clearly highlights the potency of the hybrid architecture deployed in ARMDN. The snapshot provided in Figure 8 is for weeks 1518 window of the test set. Here the event week is 4weeksout and represented by the time point in the Figure. RMDN is the worst performing of the alternatives, and all the other models which do not omit the associative layer beats the vanilla LSTM model. This reinforces the intuitive belief that the effect of the associative factors are largest during the event weeks. Another interesting observation is that though both AR and AMDN retain the associative layer, the models with MDN outperforms the one with the single Gaussian. As mentioned above, the event weeks are also peppered with latent variables and the noise tolerance brought about by the MDN has clear advantages. The complete ARMDN machine significantly outperforms all the other architectures including AMDN, highlighting that even during weeks where associative factors hold greater importance, the additional temporal features captured by the recurrent layer adds value. The same observations also hold true for weeks 11 and 14 in the other test windows.
5 Conclusions and Future Directions
In this work, we develop a unified ML architecture that can simultaneously model causal factors, timeseries trends and multimodal output spaces. The machine can be trained endtoend, in reasonable time over a large realworld dataset. Results show that the proposed architecture easily outperforms a stateoftheart model based on boosted decision trees.
Deep Learning based solutions have been found to be adept at automatic feature learning for structured data such as Images and Speech. However, for problems over multimodal data that spans numerical, categorical, binary and timeseries modalities, Deep Learning benefits heavily from socalled “expertdesigned” features. Similar observation was also made previously in [8].
One of the key aspects to evaluate in the future, is the effect of prediction errors on metrics such as logistics cost, SLA adherence and overall customer satisfaction. Modeling and evaluating these aspects, while interesting, is beyond the scope of this work.
Future work shall focus on the forecasting at different granularities of Product Geography Time. It would be interesting to examine if the solution we proposed in this work, could apply directly to other granularities; or if it would require nontrivial redesign of the architecture. Similarly interesting exploration would be in the reuse of a learned model at a certain granularity to initialize the model for a different granularity.
Footnotes
 In general, different products might have different lengths of historical data. We handle these differences in our model but assume for simplicity of writing that all times series are of the same length.
References
 Winning Model Documentation for Rossmann Store Sales at: https://kaggle2.blob.core.windows.net/forummessageattachments/102102/3454/Rossmann_nr1_doc.pdf.
 T. Appelhans, E. Mwangomo, D. R. Hardy, A. Hemp, and T. Nauss. Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania. Spatial Statistics, 14:91 – 113, 2015.
 V. Bjorn. Multiresolution methods for financial time series prediction. Proceedings of the IEEE/IAFE 1995 Conference on Computational Intelligence for Financial Engineering, page 97, 1995.
 G. Box and D. Cox. An analysis of transformations. Journal of Royal Statistical Society. Series B (Methodological), 26(2):211–252, 1964.
 G. Box and G. M. Jenkins. Some recent advances in forecasting and control. Journal of Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
 E. Busseti, I. Osband, and S. Wong. Deep learning for time series modeling. Technical Report, Stanford University, 2012.
 D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). CoRR, abs/1511.07289, 2015.
 P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pages 191–198, 2016.
 R. Cristi and M. Tummula. Multirate, multiresolution, recursive kalman filter. Signal processing, 80:1945–1958, 2000.
 K. J. Ferreira, B. H. A. Lee, and D. SimchiLevi. Analytics for and online retailer: Demand forecasting and price optimization. Manufacturing and Service Operations Management, 18(1):69–88, 2015.
 V. Flunkert, D. Salinas, and J. Gasthaus. DeepAR: Probabilistic forecasting with autoregressive neural networks. arXiv:1704.04110, 2017.
 I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 R. Hyndman, A. Koehler, J. Ord, and R. Snyder. Forecasting with exponential smoothing: The state space approach. Springer, 2008.
 R. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online postedprice auctions. pages 594–605, 2003.
 N. Kourentzes. Intermittent demand forecasts with neural networks. International Journal of Production Economics, 143(1):198 – 206, 2013.
 M. Kuhn, S. Weston, C. Keefer, and N. Woulton. Cubist: Rule and instance based regression modeling. R Package Version 0.0.13, CRAN, 2013.
 T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi. Time series forecasting using a deep belief network with restricted Boltzmann machines. Neurocomputing, 137:47 – 56, 2014.
 P. D. Larson, D. SimchiLevi, P. Kaminsky, and E. SimchiLevi. Designing and manging the supply chain. Journal of Business Logistics, 22(1):259–261, 2001.
 Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 C. M. Bishop. Mixture density networks. 01 1994.
 H. Meyer, M. Katurji, T. Appelhans, M. Muller, T. Nauss, P. Roudier, and P. ZawarReza. Mapping daily air temperature for Antarctica based on MODIS LST. Remote Sensing, 8:732, 2016.
 E. Mocanu, P. H. Nguyen, M. Gibescu, E. M. Larsen, and P. Pinson. Demand forecasting at low aggregation levels using factored conditional restricted boltzmann machine. 2016 Power Systems Computation Conference (PSCC), pages 1–7, 2016.
 J. Moody and W. Lizhong. What is true price? state space models for high frequency fx data. Proceedings of the IEEE/IAFE 1997 Conference on Computational Intelligence for Financial Engineering, pages 150–156, 1997.
 K. Papagiannaki, N. Taft, Z. L. Zhang, and C. Diot. Longterm forecasting of internet backbone traffic: observations and initial models. IEEE Transactions on Neural Networks, 2:178–1188, 2003.
 X. Qiu, L. Zhang, Y. Ren, P. N. Suganthan, and G. Amaratunga. Ensemble deep learning for regression and time series forecasting. 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL), pages 1–6, 2014.
 J. Quinlan. Learning with continuous classes. Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pages 343–348, 1992.
 J. Quinlan. C4.5:programs for machine learning. Morgan Kaufmann Publishers Inc, pages 236–243, 1993.
 J. Quinlan. Combining instance based and model based learning. Proceedings of the 10th International Conference on Machine learning, pages 236–243, 1993.
 A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features offtheshelf: An astounding baseline for recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW ’14, pages 512–519, 2014.
 O. Renaud, J.L. Starck, and F. Murtagh. Waveletbased forecasting of short and long memory time series. Institut d’Economie et EconomÃ©trie, UniversitÃ© de GenÃ¨ve, 2002.
 X. Shi, Z. Chen, H. Wang, D.Y. Yeung, W. kin Wong, and W. chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 1:802–810, 2015.
 S. Soltani, D. Boichu, P. Simard, and S. Canu. The longterm memory prediction by multiscale decomposition. Signal Processing, 80:2195–2205, 2000.
 C. Stolojescu, I. Railean, S. Moga, P. Lenca, and A. Isar. A wavelet based prediction method for time series. Proceedings of Stochastic Modeling Techniques and Data Analysis (SMTDA), pages 757–764, 2010.
 L. Torgo. Functional models for regression tree leaves. ICML, pages 385–393, 1997.
 J. Walton. Subpixel urban land cover estimation: Comparing cubist, random forests and support vector regression. Photogram. Eng. Remote Sens., 74:1213–1222, 2008.
 B. Wang, C. Oldham, and M. R. Hipsey. Comparison of machine learning techniques and variables for groundwater dissolved organic nitrogen prediction in an urban area. Procedia Engineering, 154:1176 – 1184, 2016.
 X. Wang and X. Shan. A wavelet based method to predict internet traffic. Communications, Circuits and Systems and West Sino Expositions, 1:690–694, 2002.
 H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 872–3876, 2014.
 B.L. Zhang, R. Coggins, M. Jabri, D. Dersch, and B.Flower. Multiresolution forecasting for futures trading using wavelet decompositions. IEEE Transactions on Neural Networks, 12(4):765–775, 2001.
 H. Zhang, F. Zhang, M. Ye, T. Che, and G. Zhang. Estimating daily air temperatures over the Tibetan Plateau by dynamically integrating MODIS LST. Journal of Geophysical Research: Atmospheres, 121(19):425–11,441, 2016.
 G. Zheng, J. Starck, J. Campbell, and F. Murtagh. The wavelet transform for filtering financial data streams. Journal of Computational Intelligence in Finance, 7:18–35, 1999.