Temporal Fusion Transformers for Interpretable Multihorizon Time Series Forecasting
Abstract.
Multihorizon forecasting problems often contain a complex mix of inputs – including static (i.e. timeinvariant) covariates, known future inputs, and other exogenous time series that are only observed historically – without any prior information on how they interact with the target. While several deep learning models have been proposed for multistep prediction, they typically comprise blackbox models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) – a novel attentionbased architecture which combines highperformance multihorizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable selfattention layers for learning longterm dependencies. The TFT also uses specialized components for the judicious selection of relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of regimes. On a variety of realworld datasets, we demonstrate significant performance improvements over existing benchmarks, and showcase three practical interpretability usecases of TFT.
\ul
1. Introduction
Multihorizon forecasting, i.e the prediction of variablesofinterest at multiple future time steps, is a crucial aspect of machine learning for time series data. In contrast to onestepahead predictions, multihorizon forecasting provides decision makers access to estimates across the entire path, allowing them to optimize their course of action at multiple steps in future. One common aspect of major forecasting scenarios is the availability of different data sources – including known information about the future (e.g. holiday dates), other exogenous time series, and static metadata – without any prior knowledge on how they interact. As such, the identification of key drivers of predictions can be important for decision makers, providing additional insights into temporal dynamics. For instance, static (i.e. timeinvariant) covariates often play a key role – such as in healthcare where genetic information can determine the expression of a disease (Stessman HA, 2014). Given the numerous realworld applications of multihorizon forecasting, e.g. in retail (Böse et al., 2017; Courty and Li, 1999), medicine (Lim et al., 2018; Zhang and Nawata, 2018) and economics (Capistran et al., 2010), improvements in existing methods bear much significance for practitioners in many domains.
Deep neural networks have increasingly been used in multihorizon time series forecasting, demonstrating strong performance improvements over traditional timeseries models (Rangapuram et al., 2018; Alaa and van der Schaar, 2019; Makridakis et al., 2020). While many architectures have focused on recurrent neural network designs (Flunkert et al., 2017; Rangapuram et al., 2018; Wen et al., 2017), recent improvements have considered the use of attentionbased methods to enhance the selection of relevant timesteps in the past (Fan et al., 2019) – including Transformerbased models in (Li et al., 2019). However, these methods often fail to consider all different types of inputs commonly present in multihorizon prediction problems, either assuming that all exogenous inputs are known into the future (Flunkert et al., 2017; Rangapuram et al., 2018; Li et al., 2019)– a common problem with autoregressive models – or neglecting important static covariates (Wen et al., 2017) – which are simply concatenated with other timedependent features at each step. With many improvements in timeseries models resulting from the alignment of architectures with unique data characteristics (Koutník et al., 2014; Neil et al., 2016), similar performance gains could also be reaped by designing networks with suitable inductive biases for multihorizon forecasting.
Most multihorizon prediction architectures are ‘blackbox’ models, where forecasts are controlled by complex nonlinear interactions between many parameters, that render explainability challenging. In turn, poor interpretability can make it difficult for model builders to improve the model quality, for business decision makers to trust a model’s outputs or for customers in understanding the outcomes of a product – due to the lack of insights into what is driving its forecast. Moreover, commonlyused methods for interpretability in deep neural networks can be further limited in time series settings. Conventional posthoc explainability methods (e.g. LIME (Ribeio et al., 2016) and SHAP (Lundberg and Lee, 2017)), for example, typically do not consider the time ordering of input features – with surrogate models independently constructed for each datapoint or with features assumed to be independent of others (including those at neighboring time steps). This can potentially lead to poor quality explanations for time series data, where dependencies between time steps are typically significant. In addition, attentionbased models, such as the Transformer architecture (Vaswani et al., 2017), only provide insights on the relevant timesteps in their conventional form, but not into important features.
In this paper, we propose the Temporal Fusion Transformer (TFT) – an attentionbased architecture which combines high performance multihorizon forecasting with interpretable insights. For performance improvements over stateoftheart benchmarks, we introduce several novel adjustments to align the architecture with the full range of potential inputs and temporal relationships common to multihorizon forecasting – specifically incorporating (1) static covariate encoders which encode context vectors for use in other parts of the network, (2) gating mechanisms throughout and sampledependent variable selection to minimize the contributions of irrelevant inputs, (3) a sequencetosequence layer to locally process known and observed inputs, and 4) a temporal selfattention decoder to learn any longterm dependencies present within the dataset. The use of specialized components also facilitates interpretability, for which we propose three use cases: to identify (i) globallyimportant variables for the prediction problem, (ii) persistent temporal patterns, and (iii) significant events. On realworld data, we show how these methods can be practically applied and the insights they bring.
2. Related Works
Deep Learning Models for Multihorizon Forecasting In line with traditional methods for multihorizon forecasting (Taieb et al., 2010; Marcellino et al., 2006), recent deep learning methods can be categorized into iterated approaches using autoregressive models (Flunkert et al., 2017; Rangapuram et al., 2018; Li et al., 2019) or direct methods using sequencetosequence models (Wen et al., 2017; Fan et al., 2019).
Iterated approaches utilize onestepahead prediction models, with multistep predictions obtained by recursively feeding predictions into future inputs. For instance, approaches with Long Shortterm Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks have been considered – such as Deep AR models (Flunkert et al., 2017) which use 3 stacked LSTM layers to generate parameters of onestepahead Gaussian predictive distributions. Deep StateSpace Models (DSSM) (Rangapuram et al., 2018) adopt a similar approach, utilizing LSTMs to generate parameters of a predefined linear statespace model with predictive distributions produced via Kalman filtering – with extensions for multivariate time series data in (Wang et al., 2019). More recently, Transformerbased architectures have been explored in (Li et al., 2019), which propose the use of convolutional layers for local processing, and a sparse attention mechanism to increase the size of the receptive field during forecasting. Despite their simplicity, iterative methods rely on the assumption that the values of all variables excluding the target are known at forecast time – such that only the target needs to be recursively fed into future inputs. However, in many practical scenarios, numerous useful timevarying inputs exist. As they are unknown in advance, their straightforward use is limited for iterative approaches. TFTs, on the other hand, explicitly account for the diversity of inputs – naturally handling static covariates and known/unknown timevarying inputs.
Direct methods are trained to explicitly generate forecasts for multiple predefined horizons at each time step. Their architectures typically rely on sequencetosequence models, using LSTM encoders to summarize historical inputs, and a variety of methods to generate future predictions. The Multihorizon Quantile Recurrent Forecaster (MQRNN) (Wen et al., 2017), for example, utilizes LSTM or convolutional encoders to generate context vectors, which are feed into multilayer perceptrons (MLPs) for each horizon. In (Fan et al., 2019) a multimodal attention mechanism is used with LSTM encoders to construct context vectors for a bidirectional LSTM decoder. Despite performance gains over LSTMbased iterative methods, interpretability is still not straightforward with standard direct methods. In contrast, we show 3 usecases for interpreting attention patterns in TFTs to produce general insights about temporal dynamics, while maintaining stateoftheart performance on a variety of datasets.
Time Series Interpretability with Attention Weights Attention mechanisms are used in translation (Vaswani et al., 2017), image classification (Wang et al., 2017) or tabular learning (Arik and Pfister, 2019) to identify salient portions for a specific example – using the magnitude of attention weights to determine the importance of different locations. Recent papers have also proposed the use of attentionbased mechanisms for timeseries interpretability (Alaa and van der Schaar, 2019; Li et al., 2019; Choi et al., 2016), with both LSTMbased (Song et al., 2018) and transformerbased (Li et al., 2019) models. However, the importance of static covariates – which may be applicable across all timesteps – may be lost with temporal importance, as these methods typically blend variables at each input. TFT alleviates this by using separate encoderdecoder attention for static features at each time step, on top of the selfattention to determine the contribution timevarying inputs.
Instancewise Variable Importance with Deep Neural Networks Instancewise variable importance can be obtained with posthoc explanation methods (Ribeio et al., 2016; Lundberg and Lee, 2017; Yoon et al., 2019) and inherentlyintepretable models (Guo et al., 2019; Choi et al., 2016). Posthoc explanation methods, such as LIME (Ribeio et al., 2016), SHAP (Lundberg and Lee, 2017) and RLLIM (Yoon et al., 2019), are applied on pretrained blackbox models. They are often based on distilling into a surrogate interpretable model, or decomposing into feature attributions. These methods are not designed to take into account the timewise ordering of inputs – i.e. they ignore sequential dependencies between the input – making it challenging to directly apply them to complex time series data. Inherentlyinterpretable model design approaches build components for feature selection directly into the architecture itself. For timeseries forecasting specifically, they are based on explicitly quantifying timedependent variable contributions. For example, Interpretable MultiVariable LSTMs (Guo et al., 2019) partition the hidden state such that each variable contributes uniquely to its own memory segment, and weights memory segments to determine variable contributions. Methods combining temporal importance and variable selection have also been considered in (Choi et al., 2016), which computes a single contribution coefficient based on attention weights from each. However, in addition to modelling only onestepahead prediction problems, existing methods also focus on instancespecific interpretations of attention weights – without providing insights into general temporal dynamics. In contrast, the usecases demonstrated in Section 7 demonstrate the capability of TFT in analyzing global temporal relationships to build insights about the data as a whole.
3. Multihorizon Forecasting
The general problem of multihorizon forecasting is depicted in Fig. 1. Let there be unique entities in a given time series dataset – such as different stores for retail forecasting or patients in the medical context. Each entity is associated with a set of static covariates , as well as inputs and scalar targets at each timestep . In a general sense, timedependent input features are subdivided into two categories – i.e. observed inputs which can only be measured at each step and are unknown beforehand, and known inputs which can be predetermined (e.g. the dayofweek at time ).
In many scenarios, the provision for prediction intervals can be useful for optimizing business decisions and risk management, by giving decision makers an indication of likely best and worstcase values that the target can take. As such, we adopt quantile regression to our multihorizon forecasting setting (e.g. outputting the , and percentiles at each time step). Each quantile forecast takes the form:
(1) 
where is the predicted sample quantile of the stepahead forecast at time , and is a prediction model. In line with other direct methods, our model simultaneously outputs forecasts for discrete prediction horizons – i.e. . We incorporate all past information within a finite lookback window , using target and known inputs only up till and including the forecast start time (i.e. ) and known inputs across the entire range (i.e. ). For notational simplicity, we omit the subscript through the papers unless explicitly required.
4. Model Architecture
We design the Temporal Fusion Transformer (TFT) to use canonical components to efficiently build feature representations for each input type (i.e. static, known inputs, observed inputs), enabling it to obtain high forecasting performance on a wide range of problems. The major constituents of the TFT are:

Gating Mechanisms – to skip over any unused components of the architecture, providing adaptive depth and network complexity to accommodate a wide range of datasets and scenarios. Gated Linear Units extensively are utilized throughout our architecture, and Gated Residual Network is proposed as a main building block.

Variable Selection Networks – to select relevant input variables at each time step.

Static Covariate Encoders – to integrate static features into the network, through encoding of context vectors to condition temporal dynamics.

Temporal Processing – to learn both long and shortterm temporal relationships, while naturally handling both observed and a priori know timevarying inputs. A sequencetosequence layer is employed for local feature processing, whereas longterm dependencies are captured using a novel interpretable multihead attention block.

MultiHorizon Forecast Intervals Prediction – to yield quantile forecasts produced at each prediction horizon.
Fig. 2 shows the high level architecture of the TFT, with individual components described in detail in the subsequent sections. An opensource implementation of the model can also be found on GitHub
4.1. Gating Mechanisms
As previously highlighted, the precise relationship between exogenous inputs and targets is often unknown in advance, making it difficult to anticipate which variables are relevant. Moreover, it makes it difficult to determine the extent of nonlinear processing required, and there may be instances where simpler models can be beneficial – e.g. when datasets are small or noisy.
To apply nonlinear processing only where needed, we introduce the Gated Residual Network (GRN) as a basic building block of TFT, as shown in in Fig. 2. At the lowest level, the GRN takes in a primary input and an optional context vector and yields:
(2)  
(3)  
(4) 
where ELU is the Exponential Linear Unit activation function (Clevert et al., 2016), are intermediate layers, LayerNorm is standard layer normalization of (Lei Ba et al., 2016), and is an index used to denote how weights are shared. We adopt component gating layers based on Gated Linear Units (GLUs) (Dauphin et al., 2017) to provide the flexibility to suppress any parts of the architecture that are not required for a given dataset. Letting be the input, the GLU then takes the form:
(5) 
where is the sigmoid activation function, , are the weights and biases, is the elementwise Hadamard product, and is the hidden state size (common across the TFT). GLU allows the TFT to control the extent to which the GRN contributes to the original input – potentially skipping over the layer entirely if necessary. For instances without a context vector, the GRN simply treats the contex input as zero – i.e. in Eq. (4). During training, dropout is applied before the gating layer and layer normalization – i.e. to in Eq. (3).
4.2. Variable Selection Networks
While multiple variables may be available, their relevance and specific contribution to the output are typically unknown. The TFT is designed to provide instancewise variable selection – through the use of variable selection networks applied to both static covariates and timedependent covariates. Beyond providing insights into which variables are most significant for the prediction problem, variable selection also allows the TFT to remove any unnecessary noisy inputs which could negatively impact performance.
We use entity embeddings (Gal and Ghahramani, 2016) for categorical variables and linear transformations for continuous variables, to transform each input variable into a dimensional vector – matching the dimensions in subsequent layers for skip connections. In addition, all static, past and future inputs make use of separate variable selection networks as denoted by different colors in Fig. 2. Without loss of generality, we present the variable selection network for historical inputs – noting that those for other inputs take the same form.
Let denote the transformed input of the th variable at time , with being the flattened vector of all historical inputs at time . Variable selection weights are generated by feeding both and an external context vector through a GRN, followed by a Softmax layer:
(6) 
where is a vector of variable selection weights, and is obtained from a static covariate encoder (see Section 4.3). For static variables, we note that the context vector is omitted – given that it already has access to static information.
At each time step, an additional layer of nonlinear processing is employed by feeding each through its own GRN:
(7) 
where is the processed feature vector for variable . We note that each variable has its own , with weights shared across all time steps . Processed features are then weighted by their variable selection weights and combined as below:
(8) 
where is the jth element of vector .
4.3. Static Covariate Encoders
To build complex representations of static metadata, we use four separate GRN encoders to produce different context vectors. These are then wired into various locations in the temporal fusion decoder (Section 4.5) where static variables play an important role in processing. Specifically, this includes contexts for 1) temporal variable selection (), 2) local processing of temporal features (), and 3) enriching of temporal features with static information (. As an example, taking to be the output of the static variable selection network, contexts for temporal variable selection would be encoded according to .
4.4. Interpretable MultiHead Attention
To learn longterm relationships across different time steps, TFT employs a selfattention mechanism. In a broad sense, attention mechanisms scale values based on relationships between keys and queries as below:
(9) 
where is a normalization function. A common choice is scaled dotproduct attention (Vaswani et al., 2017):
(10) 
In the canonical form used in the Transformer (Vaswani et al., 2017), multihead attention uses different heads to attend to different representation subspaces, with each head applying the mechanism of Eq. (9):
(11)  
(12) 
where , , are headspecific weights for keys, queries and values, and linearly combines outputs concatenated from all heads .
Given that different values are used in each head, analyzing attention weights alone would not be indicative of a particular feature’s overall importance. As such, we modify multihead attention to share values in each head, and employ additive aggregation of all heads at the output:
(13) 
(14)  
(15)  
(16) 
where are value weights shared across all heads, and is used for final linear mapping. From Eq. (15), we see that each head is able to learn different temporal patterns, while attending to a common set of input features – which can be interpreted as a simple ensemble over attention weights into combined matrix in Eq. (14). Compared to in Eq. (10), we can see that yields an increased representation capacity in an efficient way.
4.5. Temporal Fusion Decoder
The temporal fusion decoder uses the series of layers described below to learn temporal relationships present in the dataset:
Locality Enhancement with SequencetoSequence Layer
Points of significance in time series data are often identified in relation to its surrounding values – such as anomalies, changepoints or cyclical patterns. Leveraging local context, through the construction of features that utilize pattern information on top of pointwise values, can thus lead to performance improvements in attentionbased architectures, as also highlighted in (Li et al., 2019). For instance, (Li et al., 2019) adopt a single convolutional layer for locality enhancement – extracting local patterns using the same filter across all time. However, this might not be suitable for cases when observed inputs exist, due to the differing number of past and future inputs. As such, we propose the application of a sequencetosequence model to naturally handle these differences – feeding into the encoder and into the decoder. This then generates a set of uniform temporal features which serve as inputs into the temporal fusion decoder itself – denoted by with being a position index. For comparability with commonlyused sequencetosequence baselines, we consider the use of an LSTM encoderdecoder model – although other models can potentially be adopted as well. This also serves as a replacement for standard positional encoding, providing an appropriate inductive bias for the time ordering of the inputs. Moreover, to allow static metadata to influence local processing, we use the context vectors from the static covariate encoders to initialize the cell state and hidden state respectively for the first LSTM in the layer. We also employ a gated skip connection over this layer:
(17) 
where is a position index.
Static Enrichment Layer
As static covariates often have a significant influence on the temporal dynamics (e.g. genetic information on disease risk), we introduce a static enrichment layer that enhances temporal features with static metadata. For a given position index , static enrichment takes the form:
(18) 
where the weights of are shared across the entire layer, and is a context vector from a static covariate encoder.
Temporal SelfAttention Layer
Following static enrichment, we next apply selfattention to the temporal features. All staticenriched temporal features are first grouped into a single matrix – i.e. – and interpretable multihead attention (see Section 4.4) is applied at each forecast time (with ):
(19) 
to yield . are chosen, where is the number of heads. Decoder masking (Vaswani et al., 2017; Li et al., 2019) is applied to the multihead attention layer to ensure that each temporal dimension can only attend to features preceding it. Besides preserving causal information flow via masking, the selfattention layer allows the TFT to pick up longrange dependencies that may be challenging for RNNbased architectures to learn. Following the selfattention layer, an additional gating layer is also applied to facilitate training:
(20) 
Positionwise Feedforward Layer
Lastly, we apply an additional nonlinear processing to the outputs of the selfattention layer. Similar to the static enrichment layer, this makes use of a series of GRNs:
(21) 
where the weights of are shared across the entire layer. As per Fig. 2, we also apply a gated residual connection which skips over the entire transformer block, providing a direct path to the sequencetosequence layer – yielding a simpler model if additional complexity is not required, as shown below:
(22) 
4.6. Quantile Outputs
In line with previous work (Wen et al., 2017), the TFT also generates prediction intervals on top of point forecasts. This is achieved by the simultaneous prediction of various percentiles (e.g. , and ) at each time step. Quantile forecasts are generated using linear transformation of the output from the temporal fusion decoder:
(23) 
where are linear coefficients for the specified quantile . We note that forecasts are only generated for horizons in the future – i.e. .
5. Training Procedure
As per (Wen et al., 2017), the TFT is trained by jointly minimizing the quantile loss terms summed across all quantile outputs:
(24)  
(25) 
where is the domain of training data containing samples, represents the weights of the TFT, is the set of output quantiles, and . For outofsample testing, we evaluate the normalized quantile losses across the entire forecasting horizon – focusing on P50 and P90 risk for consistency with previous work (Flunkert et al., 2017; Rangapuram et al., 2018; Li et al., 2019):
(26) 
where is the domain of test samples. For additional information, full details on hyperparameter optimization and training can be found in Appendix C.
6. Performance Evaluation


6.1. Datasets
We choose datasets to reflect commonly observed characteristics across a wide range of challenging multihorizon forecasting problems. To establish a baseline and position with respect to prior academic work, we first evaluate performance on the Electricity and Traffic datasets used in (Flunkert et al., 2017; Rangapuram et al., 2018; Li et al., 2019) – which focus on simpler univariate time series containing known inputs only alongside the target. Next, the Retail dataset helps us benchmark the model using the full range of complex inputs observed in multihorizon prediction applications (see Section 3) – including static metadata and observed timevarying inputs. Finally, to evaluate robustness to overfitting on smaller noisy datasets, we consider the financial application of volatility forecasting – using a dataset much smaller than others. Broad descriptions of each dataset can be found below:
Electricity The UCI Electricity Load Diagrams Dataset contains the electricity consumption of 370 customers – aggregated on an hourly level as in (Yu et al., 2016). In accordance with (Flunkert et al., 2017), we use the past week of data (i.e. 168 hours) to forecast consumption over the next day (i.e. 24 hours).
Traffic The UCI PEMSF Traffic Dataset describes the occupancy rate (with )for 440 San Francisco Bay Area freeways – as in (Yu et al., 2016). This is also aggregated on an hourly level as per the electricity dataset, with the same look back window and forecast horizon.
Retail Favorita Grocery Sales Dataset from the Kaggle competition (Favorita, 2018), that combines metadata for different products and the stores, along with other exogenous timevarying inputs sampled at the daily level. We forecast log product sales 30 days into the future, using 90 days of historical information.
Volatility The OMI realized library (Heber
et al., 2009) contains daily realized volatility values of 31 stock indices computed from intraday data, along with their daily returns. For our experiments, we consider forecasts over the next week (i.e. 5 business days) using information over the past year (i.e. 252 business days).
For each dataset, we partition all time series into 3 sections – a training set for network calibration, a validation set for hyperparameter optimisation, and a holdout test set for performance evaluation. Full details on the feature engineering steps and train/test splits are provided for each dataset in Appendix B.
6.2. Benchmarks
We extensively compare the TFT to a wide range of machine learning models for multihorizon forecasting, based on the categories described in Section 2. Hyperparameter optimization is conducted using random search over a predefined search space, using the same number of iterations across all benchmarks for a give dataset. Additional details on benchmark model training are also included in Appendix C for reference.
Direct methods As the TFT falls within this class of multihorizon models, we primarily focus comparisons on deep learning methods which directly generate prediction at future horizons. This specifically includes 1) simple sequencetosequence models with global contexts (Seq2Seq), and 2) the Multihorizon Quantile Recurrent Forecaster (MQRNN) – both of which are described in (Wen et al., 2017).
Iterative methods To position with respect to the rich body of work on iterative models, we evaluate the TFT using the same setup as (Flunkert et al., 2017) for the Electricity and Traffic datasets. This extends the results from (Li et al., 2019) for 1) DeepAR models (Flunkert et al., 2017), 2) Deep State Space Models (DSSM) (Rangapuram et al., 2018), and 3) the Transformerbased architecture of (Li et al., 2019) with local convolutional processing – which refer to as ConvTrans. For more complex datasets, we focus on the ConvTrans model given its strong outperformance over other iterative models in prior work. As models in this category require knowledge of all inputs in the future to generate predictions, we accommodate this for complex datasets by imputing unknown inputs with their last available value.
6.3. Results and Discussion
Tables 1 show that the TFT significantly outperforms all benchmarks over the variety of datasets described in Section 6.1. For median forecasts, the TFT yields lower P50 and lower P90 losses on average compared to the next best model – demonstrating the benefits of explicitly aligning the architecture with the general multihorizon forecasting problem. Further ablation analyses can be found in Appendix D.
Comparing direct and iterative model performance, we observe the importance of accounting for the observed inputs – noting the poorer results of ConvTrans on complex datasets where observed input imputation is required (i.e. Volatility and Retail). Furthermore, the benefits of quantile regression are also observed when targets are not modelled well by conditional Gaussians with directly method outperforming in those scenarios. This can be seen, for example, from the Traffic dataset where target distribution is significantly skewed – with more than of occupancy rates falling between 0 and 0.1, and the remainder distributed evenly until 1.0.
7. Interpretability Use Cases
Having established the performance benefits of our model, we next demonstrate how to analyze components of the TFT to interpret the general relationships it has learned. We demonstrate three interpretability usecases: 1) examining the importance of each input variable in prediction, 2) visualizing persistent temporal patterns, and 3) identifying any regimes or events that lead to significant changes in temporal dynamics. In contrast with other examples of attentionbased interpretability (Song et al., 2018; Li et al., 2019; Alaa and van der Schaar, 2019), which zoom in on interesting but instancespecific examples, we note that our methods focus on ways to aggregate the patterns across the entire dataset – allowing us to extract generalizable insights about temporal dynamics.
7.1. Analyzing Variable Importance
We first quantify variable importance by analyzing the variable selection weights described in Section 4.2. Concretely, we aggregate selection weights (i.e. in Eq. (8)) for each variable across our entire test set, recording the , and percentiles of each sampling distribution. Given its wide range of inputs, we present results on the Retail dataset in Table 2 – with the remainder presented in Appendix E.1. Overall, the TFT focuses on only a subset of key inputs that significantly contribute to predictions.



7.2. Visualizing Persistent Temporal Patterns
The analysis of persistent temporal patterns is often key to understanding the timedependent relationships present in a given dataset. For instance, lag models are frequently adopted to study length of time required for an intervention to take effect (Du et al., 2018) – such as the impact of a government’s increase in public expenditure on the resultant growth in Gross National Product (Baltagi, 2008). Seasonality models are also commonly used in econometrics to identify periodic patterns in a targetofinterest (Hylleberg, 1992) and measure the length of each cycle. Using the attention weights present in the selfattention layer of the temporal fusion decoder, we present a method below to identify similar persistent patterns – by measuring the contributions of features at fixed lags in the past on forecasts at various horizons.
Combining Eq. (14) and (19), we see that the selfattention layer contains a matrix of attention weights at each forecast time – i.e. . As such, multihead attention outputs at each forecast horizon can be described as an attentionweighted sum of lower level features at each position :
(27) 
where is the th element of , and is a row of . Due to decoder masking, we also note that , . For each forecast horizon , the importance of a previous time point can hence be determined by analyzing distributions of across all time steps and entities. We present results for the Traffic dataset below, with similar findings on Electricity and Retail presented in Appendix E.2.
Fig. 3 shows the temporal patterns learned by the TFT – with the top chart recording the mean along the , and percentiles of the attention weights for onestepahead forecasts (i.e. ) over the test set, and the average attention weights for various horizons (i.e. ) on the bottom. Based on the regularlyspaced peaks at 24hour intervals, we can infer that the TFT has learned a strong daily seasonal pattern for predictions – placing the largest attention on the same hour of preceding days.
7.3. Identifying Regimes & Significant Events
Apart from persistent patterns, identifying sudden changes in temporal patterns can also be very useful, as temporary shifts can occur due to the presence of significant regimes or events. For instance, regimeswitching behavior has been widely documented in financial markets (Ang and Timmermann, 2012), with returns characteristics – such as volatility – being observed to change abruptly between regimes.
Firstly, for a given entity, we define the average attention pattern per forecast horizon to be:
(28) 
and then construct . To compare similarities between attention weight vectors, we use the distance metric proposed by (Comaniciu et al., 2003):
(29) 
where is the Bhattacharya coefficient (Kailath, 1967) measuring the overlap between discrete distributions – with being elements of probability vectors respectively. For each entity, significant shifts in temporal dynamics are then measured using the distance between attention vectors at each point with the average pattern, aggregated for all horizons as below:
(30) 
where .
We use the volatility dataset as a test case for regime identification, specifically applying our distance metric to the attention patterns for the S&P 500 index over our training period from 2001 to 2015. Plotting against the target (i.e. log realized volatility) in the bottom chart of Fig. 4, significant deviations in attention patterns can be observed around periods of high volatility – corresponding to the peaks observed in . From the plots, we can see that the TFT appears to alter its behaviour between regimes – placing equal attention across historical inputs when volatility is low, while attending more to sharp trend changes during high volatility periods – suggesting differences in temporal dynamics learned in each.
8. Conclusions
We introduce the Temporal Fusion Transformer (TFT) – a novel attentionbased deep neural network model for interpretable highperformance multihorizon time series forecasting. The TFT utilizes specialized components to handle the full range of inputs typically present in multihorizon forecasting problems (i.e. static covariates, a priori known inputs, and observed inputs). Specifically, these include: 1) sequencetosequence and attention based temporal processing components that capture timevarying relationships at different timescales, 2) static covariate encoders that allow the network to condition temporal forecasts on static metadata, 3) gating components that enable skipping over any parts of the network that are unnecessary for a given dataset, 4) variable selection networks that select relevant input features at each time step, and 5) quantile predictions to obtain output intervals across all prediction horizons. Through tests on a series of realworld datasets, we show that the TFT achieves stateoftheart forecasting performance on both simple datasets that contain only known inputs, and complex datasets which encompass the full range of possible inputs. Finally, we investigate the general relationships learned by the TFT through a series of interpretability usecases – proposing novel methods to use the TFT to 1) analyze important variables for a given prediction problem, 2) visualize persistent temporal relationships learned (e.g. seasonality), and 3) identify significant regimes present in the dataset.
Acknowledgements.
The authors gratefully acknowledge discussions with Yaguang Li, Maggie Wang, Jeffrey Gu and Andrew Moore that contributed to the development of this paper.Appendix
Appendix B Dataset Description
Additional details on each dataset can be found below. We provide all the sufficient information on feature preprocessing and train/test splits to ensure reproducibility of our results.
Electricity Per (Flunkert et al., 2017), we use 500k samples taken between 20140101 to 20140901 – using the first for training, and the last as a validation set. Testing is done over the 7 days immediately following the training set – as described in (Flunkert et al., 2017; Yu et al., 2016). Given the large differences in magnitude between trajectories, we also apply zscore normalization separately to each entity for realvalued inputs. In line with previous work, we consider the electricity usage, dayofweek, hourofday and and a time index – i.e. the number of timesteps from the first observation – as realvalued inputs, and treat the entity identifier as a categorical variable.
Traffic Tests on the Traffic dataset are also kept consistent with previous work, using 500k training samples taken before 20080615 as per (Flunkert et al., 2017), and split in the same way as the Electricity dataset. For testing, we use the 7 days immediately following the training set, and zscore normalization was applied across all entities. For inputs, we also take traffic occupancy, dayofweek, hourofday and and a time index as realvalued inputs, and the entity identifier as a categorical variable.
Retail For the retail dataset, we treat each product numberstore number pair as a separate entity, with over 135k entities present within the full dataset. For network calibration, the training set is made up of 450k samples taken between 20150101 to 20151201, validation set of 50k samples from the 30 days after the training set, and test set of all entities over the 30day horizon following the validation set. We use all inputs supplied by the Kaggle competition (full list in the variable importance section of Appendix E.1 – including additional variables for the dayofweek, dayofmonth, and month. Data is also resampled at regular daily intervals, imputing any missing days using the last available observation. We also include an additional ’open’ flag to denote whether data is present on a given day. For holidays, we group national, regional, and local holidays as separate categorical variables. We also apply a logtransform on the sales data, and adopt zscore normalization across all entities. We consider log sales, transactions, oil to be realvalued variables – with the remainder treated as categorical inputs.
Volatility Data is downloaded from 20000103 to 20190628 – with the training set consisting of data before 2016, the validation set from 20162017, and the test set data from 2018 onwards. For the target, we focus on 5min subsampled realized volatility (i.e. the rv5_ss column ), and add daily opentoclose returns as an extra exogenous input. Also, additional variables are included for the dayofweek, dayofmonth, weekofyear, and month – along with a ’region’ variable for each index (i.e. Americas, Europe or Asia). Finally, a time index is added to denote the number of days from the first day in our training set. We treat all daterelated variables (i.e. dayofweek, dayofmonth, weekofyear, and month) and the region input as categorical variables. A log transformation is also applied to the target, and all inputs are zscore normalized across all entities.
Appendix C Training Details
Electricity  Traffic  Retail  Volatility  
Dataset  Target Type  
Details  Num. Entities  370  440  130k  41  

500k  500k  500k  100k  
Optimal  Random Search Iterations  60  60  60  240  
Hyperparameters  State Size  160  320  240  160  
Dropout Rate  0.1  0.3  0.1  0.3  
Minibatch Size  64  128  128  64  
Learning Rate  0.001  0.001  0.001  0.01  
Max Gradient Norm  0.01  100  100  0.01  
Num. Heads  4  4  4  1 
Hyperparameter optimization is conducted via random search, using 240 iterations for the smaller Volatility dataset, and 60 iterations for the larger Electricity, Traffic and Retail datasets. Full search ranges for all hyperparameters are below, and the optimal TFT hyperparameters can be found in Table 3:

State size – 10, 20, 40, 80, 160, 240, 320

Dropout rate – 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9

Minibatch size – 64, 128, 256

Learning rate – 0.0001, 0.001, 0.01

Max. gradient norm – 0.01, 1.0, 100.0

Num. heads – 1, 4
To preserve the explainability of interpretable multihead attention, we adopt only a single stack (i.e. temporal fusion decoder block) for the TFT itself. For ConvTrans, we adopt the same fixed stack size and number of heads used in the original paper (Li et al., 2019) – setting them to 8 heads and 3 layers respectively. We also used the full attention model of (Li et al., 2019), and treated kernel sizes for the CNN local processing layer as a hyperparameter (i.e. kenel size ) – as optimal kernel sizes were observed to be dataset dependent in (Li et al., 2019).
Appendix D Ablation Analysis
To quantify the benefits of each of our proposed architectural contribution, we perform an extensive ablation analysis – removing each component from the network as below, and quantifying the percentage increase in loss versus the original architecture:
Gating layers The effects of gated skip connections are tested by replacing each GLU layer (Eq. (5)) with a simple linear layer passed through an ELU activation function.
Static covariate encoders The importance of specialized static encoders are tested by setting all context vectors to zero – i.e. – and concatenating all transformed static inputs to all timedependent past and future inputs.
Variable selection networks The effects of instancewise variable selection are tested by replacing the softmax outputs of Eq. 8 with a vector of trainable coefficients, and removing the networks generating the variable selection weights. We retain, however, the variablewise GRNs (i.e. Eq. (7), maintaining the same degree nonlinear processing as before.
Selfattention layers The benefits of the selfattention layer are quantified by replacing the attention matrix used in the interpretable multihead attention layer (Eq. 14) with a matrix of trainable parameters – i.e. , where . This prevents the TFT from attend to different input features at different forecast times, helping us evaluate the importance of instancewise attention weights.
Sequencetosequence layers for local processing We evaluate the importance of local processing by removing the sequencetosequence layer of Section 4.5.1 – replacing this with standard positional encoding used in (Vaswani et al., 2017).
Ablated networks are trained across for each dataset using the hyperparameters of Table 3, with full results shown in Figure 5. From the charts, the effects on both P50 and P90 losses are found to be similar across all datasets, with all components contributing to performance improvements on the whole. In general, the components responsible for capturing temporal relationships (i.e. local processing and selfattention layers) have the largest impact on performance, with P90 loss increases of on average and on select datasets when ablated. Static encoders and variable selection have the next largest impact – increasing P90 losses by more than on average and up to for specific datasets. Finally, gating layer ablation also significant increases in P90 losses, with a increase on average. This is most significantly show on the the volatility dataset (with a P90 loss increase), demonstrating the utility of component gating for smaller, noisier datasets.
Appendix E Additional Interpretability Results
On top of the interpretability use cases of Section 7, which highlight our most prominent findings, we also include the remaining results in this section for completeness.
e.1. Variable Importance
Table 4 shows the variable importance scores for the remaining Electricity, Traffic and Volatility datasets. Given that only one static input is present for these datasets, the network allocates full importance for the entity identifier for Electricity and Traffic, as well as for the region input for Volatility. We also observe two general types of import timedependent inputs – those related to past values of the target as before, and those related to calendar effects. For instance, the hourofday plays a significant roles for Electricity and Traffic datasets, echoing the daily seasonality observed in the next section. In the Volatility dataset, the dayofmonth is observe to play a significant role in future inputs – potentially reflecting turnofmonth effects (Giovanis, 2014).



e.2. Persistent Temporal Patterns
Fig. 6 shows the attention weight patterns across all datasets, and extends the results of Section 7.2. We observe that the three datasets exhibit a seasonal pattern, with clear attention spikes at daily intervals observed for the Electricity and Traffic datasets, and a slightly weaker weekly patterns for the Retail dataset. No strong persistent patterns were observed for the Volatility datasets however, with attention weights equally distributed across all positions on average. This resembles a moving average filter at the feature level, and – given the high degree of randomness associated with the volatility process – could be useful in extracting the trend over the entire period by smoothing out highfrequency noise.
Footnotes
 conference: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 22–27, 2020; San Diego, CA
 booktitle: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 22–27, 2020, San Diego, CA
 GitHub URL: https://github.com/googleresearch/googleresearch/tree/master/tft
References
 A. Alaa and M. van der Schaar. 2019. Attentive StateSpace Modeling of Disease Progression. In Advances in Neural Information Processing Systems 32 (NIPS 2019).
 Andrew Ang and Allan Timmermann. 2012. Regime Changes and Financial Markets. Annual Review of Financial Economics 4, 1 (2012), 313–337.
 Sercan O. Arik and Tomas Pfister. 2019. TabNet: Attentive Interpretable Tabular Learning. (2019). arXiv:1908.07442
 Badi Baltagi. 2008. Distributed Lags and Dynamic Models. Springer Berlin Heidelberg, 129–145.
 JoosHendrik Böse et al. 2017. Probabilistic Demand Forecasting at Scale. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1694–1705.
 Carlos Capistran, Christian Constandse, and Manuel RamosFrancia. 2010. Multihorizon inflation forecasts using disaggregated data. Economic Modelling 27, 3 (2010), 666 – 677.
 Edward Choi et al. 2016. RETAIN: An Interpretable Predictive Model for Healthcare Using Reverse Time Attention Mechanism. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016).
 DjorkArne Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In International Conference on Learning Representations (ICLR 2016).
 D. Comaniciu, V. Ramesh, and P. Meer. 2003. Kernelbased object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 5 (2003), 564–577.
 Pascal Courty and Hao Li. 1999. Timing of Seasonal Sales. The Journal of Business 72, 4 (1999), 545–572.
 Yann Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
 Sizhen Du, Guojie Song, Lei Han, and Haikun Hong. 2018. Temporal Causal Inference with Time Lag. Neural Computation 30, 1 (2018), 271–291.
 Chenyou Fan et al. 2019. MultiHorizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19).
 Corporacion Favorita. 2018. Corporacion Favorita Grocery Sales Forecasting Competition. (2018). https://www.kaggle.com/c/favoritagrocerysalesforecasting/
 Valentin Flunkert et al. 2017. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. CoRR abs/1704.04110 (2017). arXiv:1704.04110
 Yarin Gal and Zoubin Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Advances in Neural Information Processing Systems 29.
 Eleftherios Giovanis. 2014. The TurnofTheMonthEffect: Evidence from Periodic Generalized Autoregressive Conditional Heteroskedasticity (PGARCH) Model. International Journal of Economic Sciences and Applied Research 7 (12 2014), 43–61.
 Tian Guo, Tao Lin, and Nino AntulovFantulin. 2019. Exploring interpretable LSTM neural networks over multivariable data. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019).
 Gerd Heber, Asger Lunde, Neil Shephard, and Kevin K. Sheppard. 2009. OxfordMan Institute’s Realized Library. (2009). https://realized.oxfordman.ox.ac.uk/
 Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ShortTerm Memory. Neural Computation 9, 8 (Nov. 1997), 1735–1780.
 Svend Hylleberg (Ed.). 1992. Modelling Seasonality. Oxford University Press.
 T. Kailath. 1967. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Transactions on Communication Technology 15, 1 (1967), 52–60.
 Jan Koutník, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. 2014. A Clockwork RNN. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML 2014).
 Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv eprints, Article arXiv:1607.06450 (Jul 2016). arXiv:1607.06450
 Shiyang Li et al. 2019. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019).
 Bryan Lim, Ahmed Alaa, and Mihaela van der Schaar. 2018. Forecasting Treatment Responses Over Time Using Recurrent Marginal Structural Networks. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
 Scott Lundberg and SuIn Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (NIPS 2017).
 Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2020. The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36, 1 (2020), 54 – 74.
 M. Marcellino, J Stock, and M. Watson. 2006. A Comparison of Direct and Iterated Multistep AR Methods for Forecasting Macroeconomic Time Series. Journal of Econometrics 135 (2006), 499–526.
 Daniel Neil et al. 2016. Phased LSTM: Accelerating Recurrent Network Training for Long or Eventbased Sequences. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016).
 Syama Sundar Rangapuram et al. 2018. Deep State Space Models for Time Series Forecasting. In Advances in Neural Information Processing Systems 31 (NIPS 2018).
 Marco Ribeio et al. 2016. ”Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’16).
 Huan Song et al. 2018. Attend and Diagnose: Clinical Time Series Analysis Using Attention Models (AAAI 2018).
 Eichler EE Stessman HA, Bernier R. 2014. A genotypefirst approach to defining the subtypes of a complex disease. Cell 156, 5 (2014), 872–877.
 Souhaib Ben Taieb, Antti Sorjamaa, and Gianluca Bontempi. 2010. Multipleoutput modeling for multistepahead time series forecasting. Neurocomputing 73, 10 (2010), 1950 – 1957.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30.
 F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. 2017. Residual Attention Network for Image Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 Yuyang Wang et al. 2019. Deep Factors for Forecasting. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019).
 Ruofeng Wen et al. 2017. A MultiHorizon Quantile Recurrent Forecaster. In NIPS 2017 Time Series Workshop.
 Jinsung Yoon, Sercan O. Arik, and Tomas Pfister. 2019. RLLIM: Reinforcement Learningbased Locally Interpretable Modeling. (2019). arXiv:cs.LG/1909.12367
 HsiangFu Yu, Nikhil Rao, and Inderjit S Dhillon. 2016. Temporal Regularized Matrix Factorization for Highdimensional Time Series Prediction. In Advances in Neural Information Processing Systems 29.
 J Zhang and K Nawata. 2018. Multistep prediction for influenza outbreak by an adjusted long shortterm memory. Epidemiology and infection 146, 7 (2018).