A Statistical Investigation of Long Memory in Language and Music
Abstract
Representation and learning of longrange dependencies is a central challenge confronted in modern applications of machine learning to sequence data. Yet despite the prominence of this issue, the basic problem of measuring longrange dependence, either in a given data source or as represented in a trained deep model, remains largely limited to heuristic tools. We contribute a statistical framework for investigating longrange dependence in current applications of sequence modeling, drawing on the statistical theory of long memory stochastic processes. By analogy with their linear predecessors in the time series literature, we identify recurrent neural networks (RNNs) as nonlinear processes that simultaneously attempt to learn both a feature representation for and the longrange dependency structure of an input sequence. We derive testable implications concerning the relationship between long memory in realworld data and its learned representation in a deep network architecture, which are explored through a semiparametric framework adapted to the highdimensional setting. We establish the validity of statistical inference for a simple estimator, which yields a decision rule for long memory in RNNs. Experiments illustrating this statistical framework confirm the presence of long memory in a diverse collection of natural language and music data, but show that a variety of RNN architectures fail to capture this property even after training to benchmark accuracy in a language model.
1 Introduction
Advances in the design and optimization of deep recurrent neural networks (RNNs) have lead to significant breakthroughs in the modeling of complex sequence data, including natural language and music. An omnipresent challenge in these sequence modeling tasks is to capture longrange dependencies between observations, and a great variety of model architectures have been developed with this objective explicitly in mind. However, it can be difficult to assess whether and to what extent a given RNN has learned to represent such dependencies, that is, whether it has long memory.
Currently, if a model’s capacity to represent longrange dependence is measured at all, it is typically evaluated heuristically against some task or tasks in which success is taken an indicator of “memory” in a colloquial sense. Though undoubtedly helpful, such heuristics are rarely defined with respect to an underlying mathematical or statistical property of interest, nor do they necessarily have any correspondence to the data on which the models are subsequently trained. In this paper, we pursue a complementary approach in which longrange dependence is assessed as a quantitative and statistically accessible feature of a given data source. Consequently, the problem of evaluating long memory in RNNs can be reframed as a comparison between a learned representation and an estimated property of the data.
The main contribution is the development and illustration of a methodology for the estimation, visualization, and hypothesis testing of long memory in RNNs, based on an approach that mathematically defines and directly estimates longrange dependence as a property of a multivariate time series. We thus contextualize a core objective of sequence modeling with deep networks against a welldeveloped but asyetunexploited literature on long memory processes.
We offer extensive validation of the proposed approach and explore strategies to overcome problems with hypothesis testing for long memory in the highdimensional regime. We report experimental results obtained on a wideranging collection of realworld music and language data, confirming the (often strong) long range dependencies that are observed by practitioners. However, we find that this property is not adequately captured by a variety of RNNs trained to benchmark performance on a language dataset. Code corresponding to these experiments, including an illustrative Jupyter notebook, is available for download at https://github.com/alecgt/RNN_long_memory.
Related work.
Though a formal connection to long memory processes has been lacking thus far, machine learning applications to sequence modeling have long been concerned with the capture of longrange dependencies. The development of RNN models has been strongly influenced by the identification of the “vanishing gradient problem” in (Bengio et al., 1994). More complex recurrent architectures, such as long shortterm memory (Hochreiter and Schmidhuber, 1997a), gated recurrent units (Cho et al., 2014), and structurally constrained recurrent networks (Mikolov et al., 2015) were designed specifically to alleviate this problem. Alternative approaches have pursued a more formal understanding of RNN computation, for example through kernel methods (Lei et al., 2017), or by means of ablative strategies clarifying the computation of the RNN hidden state (Levy et al., 2018).
Longrange dependence is most commonly evaluated in RNN models by test performance on a synthetic task in which prediction targets are separated from relevant inputs by long, often fixed, intervals. For example, the target may be the parity of a binary sequence (socalled “parity” problems), or it may be the class of a sequence whose most recent terms are replaced with white noise (“2sequence” or “latch” problems) (Bengio et al., 1994; Bengio and Frasconi, 1994; Lin et al., 1996). A simple demonstration relatively early in RNN history by Hochreiter and Schmidhuber (1997b) showed that such tasks can often be solved quickly by random parameter search, casting doubt on their informativeness. Whereas the authors proposed a different heuristic, we seek to reframe the problem of long memory evaluation so that it is amenable to statistical analysis.
Classical constructions of long memory processes (Mandelbrot and Van Ness, 1968; Granger and Joyeux, 1980; Hosking, 1981) laid the foundation for statistical methods to estimate long memory from time series data See also (Moulines et al., 2008; Reisen et al., 2017) for recent works in this area. The multivariate estimator of Shimotsu (2007) is the foundation of the methodology we develop here. It is by now well understood that failure to properly account for long memory can severely diminish performance in even basic estimation or prediction tasks. For example, the sample variance is both biased and inefficient as an estimator of the variance of a stationary long memory process (Percival and Guttorp, 1994). Similarly, failure to model long memory has been shown to significantly harm the predictive performance of time series models, particularly in the case of multistep forecasting (Brodsky and Hurvich, 1999).
2 Background
Long memory in stochastic processes.
Long memory has a simple and intuitive definition in terms of the autocovariance sequence of a real, stationary stochastic process . The process is said to have long memory if the autocovariance
satisfies
(1) 
for some , where indicates that and is a slowly varying function at infinity. The term “long memory” is justified by the slow (hyperbolic) decay of the autocovariance sequence, which enables meaningful information to be preserved between distant observations in the series. As a consequence of this slow decay, the partial sums of the absolute autocovariance sequence diverge. This can be directly contrasted with the “short memory” case, in which the autocovariance sequence is absolutely summable. Moreover, we note that the parameter allows one to quantify the memory by controlling the strength of longrange dependencies.
In the time series literature, a spectral definition of “memory” is preferred, as it unifies the long and short memory cases. A secondorder stationary time series can be represented in the frequency domain by its spectral density function
If has a spectral density function that satisfies
(2) 
then has long memory if , short memory for , and “intermediate memory” or “antipersistence” if . The two definitions of long memory are equivalent when is quasimonotone (Beran et al., 2013).
We summarize the complementary time and frequency domain views of long memory with a simple illustration in Figure 1, which contrasts a short memory autoregressive (AR) process of order 1 with its long memory counterpart, the fractionally integrated AR process. The autocovariance series is seen to converge rapidly for the AR process, whereas it diverges for the fractionally integrated AR process. Meanwhile, Eq. (2) implies that the long memory parameter has a geometric interpretation in the frequency domain as the slope of versus as (i.e. ).
Many common models do not have long memory.
Despite the appeal and practicality of long memory for modeling complex time series, we emphasize that it is absent from nearly all common statistical models for sequence data. We offer a short list of examples, with proofs deferred to Appendix A of the Supplement.

Markov models. If is a Markov process on a finite state space , and for any function , then has short memory. We show that this property holds even in a complex model with Markov structure, the Markov transition distribution model for highorder Markov chains (Raftery, 1985). In light of this, long memory processes are sometimes called “nonMarkovian”.

Autoregressive moving average (ARMA) models. ARMA models, a ubiquitous tool in time series modeling, likewise have exponentially decaying autocovariances and thus short memory (Brockwell and Davis, 2013). This may be somewhat surprising, as causal ARMA models with nontrivial moving average components are equivalent to linear autoregressive models of infinite order.

Nonlinear autoregressions. Finally, and most importantly for our present focus, nonlinearity of the state transition function is no guarantee of long memory. We show that a class of autoregressive processes in which the state is subject to iterated nonlinear transformations still fails to achieve a slowly decaying autocovariance sequence (Lin et al., 1996).
Semiparametric estimation of long memory.
Methods for the estimation of the long memory parameter have been developed and analyzed under increasingly broad conditions. Here, we focus on semiparametric methods, which offer consistent estimation of the long memory without the need to estimate or even specify a full parametric model. The term “semiparametric” refers to the fact that the estimation protocol involves both the infinitedimensional periodogram and a finitedimensional long memory parameter.
Semiparametric estimation in the Fourier domain leverages the implication of Eq. (2) that
(3) 
as , with a nonzero constant. Estimators are constructed directly from the periodogram using only terms corresponding to frequencies near the origin. The long memory parameter is estimated either by logperiodogram regression, which yields the GewekePorterHudak (GPH) estimator (Geweke and PorterHudak, 1983), or through a local Gaussian approximation, which gives the Gaussian semiparametric estimator (GSE) (Robinson et al., 1995). The GSE offers greater efficiency, requires weaker distributional assumptions, and can be defined for both univariate and multivariate time series; therefore it will be our main focus.
Multivariate long memory processes.
Analysis of long memory in multivariate stochastic processes is a topic of more recent investigation in the time series literature. The common underlying assumption in multivariate semiparametric estimation of long memory is that the real, vectorvalued process , can be written as
(4) 
where is the component of , is a secondorder stationary process with spectral density function bounded and bounded away from zero at zero frequency, is the backshift in time operator, and for every (Shimotsu, 2007). The backshift operation , is extended to noninteger orders via
and thus is referred to as a fractionally integrated process when . Fractionally integrated processes are the most commonly used models for data with long range dependencies, encompassing parametric classes such as the vector autoregressive fractionally integrated moving average (VARFIMA), a multivariate and long memory extension of the popular ARMA family of time series models.
If is defined as in Eq. (4), then its spectral density function satisfies (Hannan, 2009)
where denotes the complex conjugate of , is the spectral density function of at frequency , and
Given an observed sequence with discrete Fourier transform
the spectral density matrix is estimated at Fourier frequency by the periodogram
Under the assumption that as for some real, symmetric, positive definite , the local behavior of around the origin is governed only by and :
(5) 
The Gaussian semiparametric estimator
The Gaussian semiparametric estimator of (Shimotsu, 2007) is computed from a local, frequencydomain approximation to the Gaussian likelihood based on Eq. (5). The approximation is valid under restriction of the likelihood to a range of frequencies close to the origin. Using the identity , we have the approximation
which is valid up to an error term of order .
The Gaussian loglikelihood is written in the frequency domain as (Whittle, 1953)
Validity of the approximation is ensured by restriction of the sum to the first Fourier frequencies, with .
Solving the firstorder optimality condition
for yields
Substitution back into the objective results in the expression
(6) 
and the Gaussian semiparametric estimator is obtained as the minimizer
(7) 
over the feasible set .
A key result due to Shimotsu (2007) establishes that the estimator is consistent and asymptotically normal under mild conditions, with
(8) 
where
is the true long memory, and denotes the Hadamard product.
Optimization.
Relatively little discussion of optimization procedures for problem in Eq. (7) is available in the time series literature. We are not aware of any proof that the objective is convex in the multivariate setting for instance.
To compute the estimator , we apply LBFGSB, a quasiNewton algorithm that handles box constraints (Byrd et al., 1995). LBFGSB is an iterative algorithm requiring the gradient of the objective; this is derived in Appendix B of the Supplement.
Bandwidth selection
The choice of the bandwidth parameter determines the tradeoff between bias and variance in the estimator: at small the variance may be high due to few data points, while setting too large can introduce bias by accounting for the behavior of the spectral density function away from the origin.
When it is possible to simulate from the target process, as will be the case when we evaluate criteria for long memory in recurrent neural networks, we can naturally control the variance simply by simulating long sequences and computing a dense estimate of the periodogram. Without knowledge of the shape of the spectral density function, however, it is difficult to know how to set the bandwidth to avoid bias, and thus we prefer the relatively conservative setting of . This choice is justified by a bias study for the bandwidth parameter, which is given in Appendix C of the Supplement.
3 Methods
RNN hidden state as a nonlinear model for a long memory process.
The standard tool for statistical modeling of long memory processes is the autoregressive fractionally integrated moving average (ARFIMA) model, which represents the process with long memory parameter as
where is a white noise process (Brockwell and Davis, 2013). The polynomials and control the autoregressive and moving average components of the model, respectively, while the fractional differencing parameterizes the difference in longrange dependence between the long memory process and the short memory white noise .
We extend this view to deep network models for sequences with long range dependencies. The key difference is that RNN models are not constrained to work with a linear representation of the data, nor do they explicitly contain a step that filters out the long memory in . We thus characterize RNN modeling for a long memory process via
(9)  
(10) 
First, the RNN generates a nonlinear representation of . The parametric structure of depends on the specific choice of recurrent architecture and in general can be highly complex, with modern implementations routinely involving multiple recurrent layers and millions of parameters. The hidden state is then typically mapped to the output quantity of interest by a simple linear transformation, which corresponds to the autoregressive form of Eq. (10). This view of deep recurrent models thus aligns with a broader theoretical characterization of deep learning as approximate linearization of complex decision boundaries in input space by means of a learned nonlinear feature representation (Mallat, 2016; Mairal et al., 2014; Jones et al., 2019; Bietti and Mairal, 2019).
Testable criteria for RNN capture of longrange dependence.
If is a process whose long memory parameter is either known or can be estimated accurately, then Eqs. (9)(10) have direct and testable implications for evaluating the memory of an RNN trained on observations from . The linearity of the output step shows that the hidden representation must have long memory identical to the white noise process . Therefore, if an RNN is to adequately capture the longrange dependence in , the nonlinear feature representation must also act as a filter that transforms the long memory to the short memory .
We identify two complementary criteria for evaluating this in practice:

Filtering of simulated long memory. Define
where is a standard Gaussian white noise and is the long memory parameter corresponding to the source on which the model was trained. If the sequence is drawn from , then we expect to find that
where is the RNN hidden representation of the simulated input. On the other hand, residual long memory in the hidden state indicates failure to adequately model this property of the original source .

Long memory transformation of white noise. Conversely, we expect to find that the RNN hidden representation of a white noise sequence has a nonzero long memory parameter. White noise has a constant spectrum and thus a long memory parameter equal to zero. If implements (or approximates) a fractional differencing operation as required for Eqs. (9)(10) to be a valid model of a long memory process, then a zeromemory input will be transformed to a nonzeromemory sequence of hidden states.
Total memory.
It is common for sequence embeddings and RNN hidden layers to have hundreds of dimensions, and thus long memory estimation for these sequences naturally occurs in a highdimensional setting. This topic is virtually unexplored in the time series literature, where multivariate studies tend to have modest dimension. Practically, this raises two main issues. First, if for dimension and bandwidth , then the approximation of the test statistic distribution by its asymptotic limit will be of poor quality, and the resulting test is likely to be miscalibrated. Second, it becomes difficult to interpret the long memory vector , particularly when the coordinates of the corresponding time series are not meaningful themselves.
We resolve both issues by considering the total memory statistic , defined as
(11) 
Computation of the total memory is no more complex than that of the GSE, and it has an intuitive interpretation as the coordinatewise aggregate strength of long memory in a multivariate time series.
Asymptotic normality of the total memory estimator.
The total memory is a simple linear functional of the GSE, and thus its consistency and asymptotic normality can be established by a simple argument. In particular, defining
we see that , which is clearly nonzero at zero, so that by Eq. (8) and the delta method we have
(12) 
where is the true total memory of the observed process.
To validate this proposed estimator, we provide a “sanity check” on simulated highdimensional data with known long memory in Appendix D of the Supplement.
Visualizing and testing for long memory in high dimensions.
The visual timedomain summary of long memory in Figure 1 can be extended to the multivariate setting. In this case, the autocovariance is matrixvalued, which for the purpose of evaluating long memory can be summarized by the scalar , where the absolute value is taken elementwise. Recall that a sufficient condition for short memory is the absolute convergence of the autocovariance series, whereas this series diverges for long memory processes.
From a testing perspective, a statistical decision rule for the presence of long memory can be derived from the asymptotic distribution of the corresponding estimator. However, when the dimension is large and we conservatively set the bandwidth , we may have even when the observed sequence is relatively long.
The classical approach to testing for the multivariate Gaussian mean is based on the Wald statistic
which has a distribution under the null hypothesis .
In Appendix E of the Supplement, we give a demonstration that the standard Wald test can be seriously miscalibrated when , whereas testing for long memory with the total memory statistic remains wellcalibrated in this setting. These results are consistent with previous observations that the Wald test for long memory can have poor finitesample performance even in low dimensions (Shimotsu, 2007; Hurvich and Chen, 2000), though these studies suggest no alternative.
4 Experiments
4.1 Long memory in language and music
Much of the development of deep recurrent neural networks has been motivated by the goal of finding good representations and models for text and audio data. Our results in this section confirm that such data can be considered as realizations of long memory processes.^{1}^{1}1Code for all results in this section is available at https://github.com/alecgt/RNN_long_memory A full summary of results is given in Table 1, and autocovariance partial sums are plotted in Figure 2. To facilitate comparison of the estimated long memory across time series of different dimension, we report the normalized total memory in all tables.
For all experiments, we test the null hypothesis
against the onesided alternative of long memory,
We set the level of the test to be and compute the corresponding critical value from the asymptotic distribution of the total memory estimator. Given an estimate of the total memory , a pvalue is computed as ; note that a pvalue less than corresponds to rejection of the null hypothesis in favor of the long memory alternative.
Data  Norm. total memory  pvalue  Reject  

Natural language  Penn TreeBank  0.163  0.0  ✓ 
Facebook CBT  0.0636  0.0  ✓  
King James Bible  0.192  0.0  ✓  
Music  J.S. Bach  0.0997  0.0  ✓ 
Miles Davis  0.322  0.0  ✓  
Oum Kalthoum  0.343  0.0  ✓ 
Natural language data.
We evaluate long memory in three different sources of English language text data: the Penn TreeBank training corpus (Marcus et al., 1993), the training set of the Children’s Book Test from Facebook’s bAbI tasks (Weston et al., 2016), and the King James Bible. The Penn TreeBank corpus and King James Bible are considered as single sequences, while the Children’s Book Test data consists of 98 books, which are considered as separate sequences. We require that each sequence be of length at least , which ensures that the periodogram can be estimated with reasonable density near the origin. Finally, we use GloVe embeddings (Pennington et al., 2014) to convert each sequence of word tokens to an equallength sequence of real vectors of dimension .
Our results show significant long memory in each of the text sources, despite their apparent differences. As might be expected, the children’s book measured from the Facebook bAbI dataset demonstrates the weakest longrange dependencies, as is evident both from the value of the total memory statistic and the slope of the autocovariance partial sum.
Music data.
Modeling and generation of music has recently gained significant visibility in the deep learning community as a challenging set of tasks involving sequence data. As in the natural language experiments, we seek to evaluate long memory in a broad selection of representative data. To this end, we select a complete Bach cello suite consisting of 6 pieces from the MusicNet dataset (Thickstun et al., 2017), the jazz recordings from Miles Davis’ Kind of Blue, and a collection of the most popular works of famous Egyptian singer Oum Kalthoum.
For the Bach cello suite, we embed the data from its raw scalar wav file format using a reduced version of a deep convolutional model that has recently achieved near stateoftheart prediction accuracy on the MusicNet collection of classical music (Thickstun et al., 2018). Details of the model training, including performance benchmarks, are provided in Appendix F of the Supplement.
We are not aware of a prominent deep learning model for either jazz music or vocal performances. Therefore, for the recordings of Miles Davis and Oum Kalthoum, we revert to a standard method and extract melfrequency cepstral coefficients from the raw wav files at a sample rate of Hz (Logan et al., 2000).
Our results show that long memory appears to be even more strongly represented in music than in text. We find evidence of particularly strong longrange dependence in the recordings of Miles Davis and Oum Kalthoum, consistent with their reputation for repetition and selfreference in their music.
Overall, while the results of this section are unlikely to surprise practitioners familiar with the modeling of language and music data, they are scientifically useful for two main reasons: first, they show that our long memory analysis is able to identify wellknown instances of longrange dependence in realworld data; second, they establish quantitative criteria for the successful representation of this dependency structure by RNNs trained on such data.
4.2 Long memory analysis of language model RNNs
We now turn to the question of whether RNNs trained on one of the datasets evaluated above are able to represent the longrange dependencies that we know to be present. We evaluate the criteria for long memory on three different RNN architectures: long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997a), memory cells (Levy et al., 2018), and structurally constrained recurrent networks (SCRN) (Mikolov et al., 2015). Each network is trained on the Penn TreeBank corpus as part of a language model that includes a learned word embedding and linear decoder of the hidden states; the architecture is identical to the “small” LSTM model in (Zaremba et al., 2014), which is preferred for the tractable dimension of the hidden state. Note that our objective is not to achieve stateoftheart results, but rather to reproduce benchmark performance in a wellknown deep learning task. Finally, for comparison, we will also include an untrained LSTM in our experiments; the parameters of this model are simply set by random initialization, which again follows the convention of Zaremba et al. (2014).
Model  Test Perplexity 

Zaremba et al.  114.5 
LSTM  114.5 
Memory cell  119.0 
SCRN  124.3 
RNN filtering of long memory input.
The first criterion for long memory asks whether the nonlinear mapping from input to RNN hidden representation is capable of transforming longrange dependent input to short memory features suitable for linear modeling. Having estimated the long memory parameter corresponding to the Penn TreeBank training data in the previous section, we simulate inputs with from a fractionally differenced Gaussian process with the same longrange dependence structure and evaluate the total memory of the corresponding hidden representation for each RNN. Results from trials are compiled in Table 3 (parentheses indicate standard error of total memory estimates). As in the previous section, we test against the null hypothesis
Model  Norm. total memory  pvalue  Reject 

LSTM (trained)  (0.00963)  ✓  
LSTM (untrained)  (0.00183)  0.0  ✓ 
Memory cell  (0.0105)  ✓  
SCRN  0.0810 (0.0107)  ✓ 
RNN transformation of white noise.
For a complementary analysis, we evaluate whether the RNNs have learned a transformation that imparts a nontrivial longrange dependency structure to white noise inputs. In this case, the input sequence is drawn from a standard Gaussian white noise process, and we test the corresponding hidden representation for nonzero total memory. As in the previous experiment, we select , choose the bandwidth parameter , and simulate independent trials for each RNN. Results are detailed in Table 4. Again, we define .
Model  Norm. total memory  pvalue  Reject 

LSTM (trained)  (0.00405)  0.583  X 
LSTM (untrained)  (0.00223)  0.572  X 
Memory cell  (0.00452)  0.552  X 
SCRN  0.00237 (0.00522)  0.324  X 
Discussion.
We summarize the main experimental result of the paper as follows: there is a statistically welldefined and practically identifiable component of the data, relevant for prediction and broadly represented in language and music sequence data, that is not successfully modeled by a collection of RNNs trained to benchmark performance.
Tables 3 and 4 show that each evaluated RNN fails both criteria for representation of the longrange dependency structure of the data on which it was trained. First, RNN hidden representations of simulated input with the same longrange dependency structure as the Penn TreeBank training data show significant residual long memory, indicating that the model does not adequately account for this property and implicating suboptimal predictive performance of the linearly transformed output. Second, hidden representations of white noise input do not have total memory significantly different from zero, echoing our result for nonlinear autoregressions indicating that nonlinearity of the state transition is not sufficient for modeling complex dependency structures. The result holds despite a training protocol that reproduces benchmark performance, and across multiple RNN architectures specifically engineered to alleviate the gradient issues typically implicated in the learning of longrange dependencies.
5 Conclusion
We have introduced and demonstrated a framework for the evaluation of long memory in RNNs that proceeds from a simple but mathematically precise definition. Under this definition, long memory is the condition enabling meaningful autocovariance at very long lags. Of course, for sufficiently complex processes, this will not be sufficient to fully characterize the longrange dependence structure. Nonetheless, it represents a practical and informative foundation upon which to develop a statistical toolkit for estimation, inference, and hypothesis testing, which goes significantly beyond the current paradigm of heuristic checks.
The long memory framework makes possible a formal investigation of specific and quantitative hypotheses concerning the fundamental issue of longrange dependencies in deep sequence learning. Our experiments investigate this phenomenon in natural language and music data, and in the learned representations of RNNs themselves. We have proposed and validated the total memory statistic as an interpretable quantity that naturally avoids the challenges associated with highdimensional testing. The experimental results suggest that while long memory is a ubiquitous feature of natural language and music data, benchmark recurrent neural network models designed to capture this phenomenon in fact fail to do so. Finally, this work suggests future topics in both time series, particularly concerning long memory analysis in high dimensions, and in deep learning, as a challenge to learn long memory representations in RNNs.
Acknowledgments
This work was supported by the Big Data for Genomics and Neuroscience Training Grant 8T32LM012419, NSF TRIPODS Award CCF1740551, the program “Learning in Machines and Brains” of CIFAR, and faculty research awards.
References
 Bengio and Frasconi (1994) Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backpropagation. In Advances in Neural Information Processing Systems, pages 75–82, 1994.
 Bengio et al. (1994) Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 Beran et al. (2013) J. Beran, Y. Feng, S. Ghosh, and R. Kulik. LongMemory Processes: Probabilistic Properties and Statistical Methods. Springer, 2013.
 Bietti and Mairal (2019) A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep convolutional representations. The Journal of Machine Learning Research, 20(1):876–924, 2019.
 Bradley et al. (2005) R. C. Bradley et al. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107–144, 2005.
 Brockwell and Davis (2013) P. J. Brockwell and R. A. Davis. Time series: theory and methods. Springer Science & Business Media, 2013.
 Brodsky and Hurvich (1999) J. Brodsky and C. M. Hurvich. Multistep forecasting for longmemory processes. Journal of Forecasting, 18(1):59–75, 1999.
 Byrd et al. (1995) R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
 Cho et al. (2014) K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
 Geweke and PorterHudak (1983) J. Geweke and S. PorterHudak. The estimation and application of long memory time series models. Journal of Time Series Analysis, 4(4):221–238, 1983.
 Granger and Joyeux (1980) C. W. Granger and R. Joyeux. An introduction to longmemory time series models and fractional differencing. Journal of Time Series Analysis, 1(1):15–29, 1980.
 Hannan (2009) E. J. Hannan. Multiple Time Series, volume 38. John Wiley & Sons, 2009.
 Hochreiter and Schmidhuber (1997a) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997a.
 Hochreiter and Schmidhuber (1997b) S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. In Advances in Neural Information Processing Systems, pages 473–479, 1997b.
 Hosking (1981) J. R. Hosking. Fractional differencing. Biometrika, 68(1):165–176, 1981.
 Hurvich and Chen (2000) C. M. Hurvich and W. W. Chen. An efficient taper for potentially overdifferenced longmemory time series. Journal of Time Series Analysis, 21(2):155–180, 2000.
 Ibragimov and Linnik (1965) I. Ibragimov and Y. V. Linnik. Independent and stationary dependent variables, 1965.
 Jones et al. (2019) C. Jones, V. Roulet, and Z. Harchaoui. Kernelbased translations of convolutional networks. arXiv preprint arXiv:1903.08131, 2019.
 Lei et al. (2017) T. Lei, W. Jin, R. Barzilay, and T. Jaakkola. Deriving neural architectures from sequence and graph kernels. In International Conference on Machine Learning, pages 2024–2033, 2017.
 Levy et al. (2018) O. Levy, K. Lee, N. FitzGerald, and L. Zettlemoyer. Long shortterm memory as a dynamically computed elementwise weighted sum. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 732–739, 2018.
 Lin et al. (1996) T. Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning longterm dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, 1996.
 Logan et al. (2000) B. Logan et al. Mel frequency cepstral coefficients for music modeling. In ISMIR, volume 270, pages 1–11, 2000.
 Mairal et al. (2014) J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pages 2627–2635, 2014.
 Mallat (2016) S. Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150203, 2016.
 Mandelbrot and Van Ness (1968) B. B. Mandelbrot and J. W. Van Ness. Fractional Brownian motions, fractional noises and applications. SIAM review, 10(4):422–437, 1968.
 Marcus et al. (1993) M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330, 1993.
 Meyn and Tweedie (2012) S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012.
 Mikolov et al. (2015) T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato. Learning longer memory in recurrent neural networks. In International Conference on Learning Representations, 2015.
 Moulines et al. (2008) E. Moulines, F. Roueff, M. S. Taqqu, et al. A wavelet Whittle estimator of the memory parameter of a nonstationary Gaussian time series. The Annals of Statistics, 36(4):1925–1956, 2008.
 Pennington et al. (2014) J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
 Percival and Guttorp (1994) D. B. Percival and P. Guttorp. Longmemory processes, the Allan variance and wavelets. In Wavelet Analysis and its Applications, volume 4, pages 325–344. Elsevier, 1994.
 Raftery (1985) A. E. Raftery. A model for highorder Markov chains. Journal of the Royal Statistical Society. Series B (Methodological), pages 528–539, 1985.
 Reisen et al. (2017) V. A. Reisen, C. LévyLeduc, and M. S. Taqqu. An Mestimator for the longmemory parameter. Journal of Statistical Planning and Inference, 187:44–55, 2017.
 Robinson et al. (1995) P. M. Robinson et al. Gaussian semiparametric estimation of long range dependence. The Annals of Statistics, 23(5):1630–1661, 1995.
 Shimotsu (2007) K. Shimotsu. Gaussian semiparametric estimation of multivariate fractionally integrated processes. Journal of Econometrics, 137(2):277–310, 2007.
 Thickstun et al. (2017) J. Thickstun, Z. Harchaoui, and S. Kakade. Learning features of music from scratch. In International Conference on Learning Representations, 2017.
 Thickstun et al. (2018) J. Thickstun, Z. Harchaoui, D. P. Foster, and S. M. Kakade. Invariances and data augmentation for supervised music transcription. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2241–2245. IEEE, 2018.
 Tjøstheim (1990) D. Tjøstheim. Nonlinear time series and Markov chains. Advances in Applied Probability, 22(3):587–611, 1990.
 Weston et al. (2016) J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov. Towards AIcomplete question answering: A set of prerequisite toy tasks. In International Conference on Learning Representations, 2016.
 Whittle (1953) P. Whittle. Estimation and information in stationary time series. Arkiv för matematik, 2(5):423–434, 1953.
 Zaremba et al. (2014) W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. CoRR abs/1409.2329, 2014.
Appendix
Appendix A Short memory of common time series models
We first note that in order to show that the class of processes described by a given time series model has short memory, it is sufficient to show
(13) 
for each process belonging to the parametric family. The property is implied by both the frequency and time domain definitions of long memory for a scalar process, which are themselves equivalent under the condition that the slowly varying part of the spectral density near zero is quasimonotone. Therefore, establishing (13) for a given class of models implies that they do not satisfy the definition of a long memory process.
Proposition A.1.
Let be an irreducible and aperiodic Markov chain on a finite state space such that its corresponding transition matrix has distinct eigenvalues. Let , and define . Then is a short memory process.
Proof.
Computation of the autocovariance for a finite state Markov model is classical, but we include it here for completeness. Let be an irreducible and aperiodic Markov chain on the finite space , and suppose that the transition matrix (where ) has distinct eigenvalues. Then has a unique stationary distribution, and we denote its elements .
Let have the distribution , and define for and some . Note that is stationary since is stationary. We will show that the scalar process has short memory.
Write the autocovariance
where .
Since has distinct eigenvalues, it is similar to a diagonal matrix :
so that
where and denote the row of and , respectively, and denotes the transpose of . Furthermore, from the existence of the unique stationary distribution we have that
so that , and since is a stochastic matrix, the corresponding left eigenvalue is
Thus
Then we can write
and since the ’s are distinct, for . Therefore,
for some , which from above implies that
The absolute convergence of the autocovariance series then follows by comparison to the dominating geometric series. ∎
Furthermore, as we next show, neither extension of the Markov chain to higher (finite) order or taking (finite) mixtures of Markov chains is sufficient to obtain a long memory process. We provide a novel proof that the mixture transition distribution (MTD) model (Raftery, 1985) for highorder Markov chains defines a short memory process under conditions similar to those of the proof above.
Proposition A.2.
Let be an order Markov chain whose transition tensor is parameterized by the MTD model
(14) 
where each is a columnstochastic matrix, for each , and . Suppose that the state space is finite with , and we define for some . Then is a short memory process.
Proof.
In order to write the autocovariance sequence of an MTD process, we must first establish its stationary distribution. Let denote the multivariate Markov transition matrix, which has entries
We make the following assumptions on :

has distinct eigenvalues

Each has strictly positive elements on the diagonal
Each state of is reachable from all others, so is irreducible. The second assumption above shows that the states corresponding to the nonzero diagonal elements of are aperiodic, and thus is aperiodic. The transition matrix therefore specifies an ergodic Markov chain and hence has a unique stationary distribution . We will denote by the univariate marginal of .
Now let be the multivariate stationary distribution of , and let be its univariate marginal. Let have the distribution , and define according to (14) for . Then both and are stationary.
The autocovariance can be written as
where .
Observe that the transition probability can be obtained from the step multivariate transition matrix via
We note that the summation over is precisely the marginalization required to obtain from . Therefore, we can write
However, for each we have
for some by an argument analogous to the Markov chain example. This implies
since is a convex combination of elements obeying the same bound. Therefore, we have
and hence the MTD model has short memory with exponentially decaying autocovariance.
∎
For processes on a realvalued state space, the autoregressive moving average (ARMA) model is a wellknown and widely used tool. ARMA models have good approximation properties, as evidenced by the existence of AR and MA orders guaranteeing aribitrarily good approximation to a stationary realvalued stochastic process with continuous spectral density (Brockwell and Davis, 2013). Furthermore, ARMA models with nontrivial moving average components are equivalent to autoregressive models of infinite order, suggesting that these models can integrate information over long histories. However, despite these appealing properties, this class of models cannot represent statistical long memory.
Proposition A.3.
Define the ARMA process by
where is a white noise process with variance and for all such that . Then is a short memory process.
Proof.
As in the Markov chain case, the proof is classical but included for completeness. Let be defined as in the statement above. Then has the representation
where the coefficients are given by
with the above series absolutely convergent on for some (cf. Brockwell and Davis (2013), Chapter 3).
Absolute convergence implies that there exists some and such that
so that there exists a for which
The autocovariance can be expressed as
and thus we can write
for .
Therefore, as with the Markov models, the autocovariance sequence of an ARMA process is not only absolutely summable but also dominated by an exponentially decaying sequence. ∎
Finally, we show that in general nonlinear state transitions are not sufficient to induce long range dependence, a point particularly relevant to the analysis of long memory in RNNs.
Proposition A.4.
Define the scalar nonlinear autoregressive process
where is a white noise sequence with positive density with respect to Lebesgue measure and satisfying , while is bounded on compact sets and satisfies
for some . Then has a unique stationary distribution , and the sequence of random variables initialized with is strictly stationary and geometrically ergodic.
Furthermore, if
for some , then is a short memory process.
Proof.
The proof proceeds by analysis of as a Markov chain on a general state space , where is the standard Borel sigma algebra on the real line. Define the transition kernel for any and .
We first establish that is aperiodic. A cycle is defined by a collection of disjoint sets such that

For , , .

The set has measure zero.
The period is defined as the largest for which has a cycle (Meyn and Tweedie, 2012). Clearly, however, since has positive density with respect to Lebesgue measure, only if up to null sets. Thus the period is , so is aperiodic.
Strict stationarity and geometric ergodicity are established by showing that the aperiodic chain satisfies a strengthened version of the Tweedie criterion (Meyn and Tweedie, 2012), which requires the existence of a measurable nonnegative function , , and such that
for some set satisfying
Under the conditions of and assumed above, this criterion is established for the process in Tjøstheim (1990) (Thm 4.1), with .
Geometric ergodicity implies that the
with , , and where denotes the total variation distance between measures. A wellknown result in the theory of Markov chains (Bradley et al., 2005) establishes that geometric ergodicity is equivalent to absolute regularity, which is parameterized by
where the supremum is taken over all finite partitions and of the sigma fields and . In particular, decays at least exponentially fast.
Furthermore, for any two sigma fields and we have
so that the mixing parameter is also bounded by an exponentially decaying sequence.
Finally, if for some , then the absolute covariance obeys (Ibragimov and Linnik (1965), Thm. 17.2.2)
which completes the proof. ∎
Appendix B Gradient of the GSE objective
Recall that the objective function is given by
with
The derivative with respect to the element