Variance Estimation For Dynamic Regression via Spectrum Thresholding
We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees or only have asymptotic guarantees without explicit rates. In addition, all existing approaches rely strongly on Guassianity of the noises. In this paper we study the global system operator: the operator that maps the noise vectors to the output. In particular, we obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. Moreover, our results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.
A dynamic linear regression (West and Harrison, 1997, Chapter 3), or non-stationary regression, is a situation where we are given a sequence of scalar observations , and observation vectors such that where is a regressor vector, and a random noise term. In contrast to a standard linear regression, the vector may change with time. One common objective for this problem is at time , to estimate the trajectory of for , given the observation vectors and observations, , , and possibly to forecast if is also known.
In this paper we model the problem as follows:
where is the standard inner product on , , the observation noise, are zero-mean sub Gaussian random variables, with variance , and the process noise variables take values in , such that coordinates of are sub Gaussian, independent, and have variance . All and variables are assumed to be mutually independent. The vectors are an arbitrary sequence in , and the observed, known, quantities at time are and .
The system (1)-(2) is a special case of a Linear Dynamical System (LDS). As is well known, when the parameters are given, the mean-squared loss optimal forecast for and estimate for are obtained by the Kalman Filter (Anderson and Moore, 1979; Hamilton, 1994; Chui and Chen, 2017). In this paper we are concerned with estimators for , and sample complexity guarantees for these estimators.
Let us first make a few remarks about the particular system (1)-(2). First, as a natural model of time varying regression, this system is useful in a considerable variety of applications. We refer to West and Harrison (1997), Chapter 3, for numerous examples. In addition, an application to electricity consumption time series as a function of the temperature is provided in the experiments section of this paper. Second, one may regard the problem of estimating in (1)-(2) as a pure case of finding the optimal learning rate for . Indeed, the Kalman filter equations for (1)-(2), are given by (3)-(4) below, where (3) describes the filtered covariance update and (4) the filtered state update. Here is the estimated state, given the observations , see West and Harrison (1997).
In particular, following (4), the role of and may be interpreted as regulating how much the estimate of is influenced, via the operator , by the most recent observation and input . Roughly speaking, higher values of or lower values of would imply that the past observations are given less weight, and result in an overfit of the forecast to the most recent observation (an illustration is given in Figure 3, Appendix A). On the other hand, very low or high would make the problem similar to the standard linear regression, where all observations are given equal weight, and result in a lag of the forecast. Finally, it is worth mentioning that the system (1)-(2) is closely related to the study of online gradient (OG) methods (Zinkevich, 2003; Hazan, 2016). In this field, assuming quadratic cost, one considers the update
where is the learning rate, and studies the performance guarantees of the forecaster . Compared to (4), the update (5) is simpler, and uses a scalar rate instead of the input-dependent operator rate of the Kalman filter. However, due to the similarity, every domain of applicability of the OG methods is also a natural candidate for the model (1)-(2) and vice-versa. As an illustration, we compare the OG to Kalman filter based methods with learned , in the experiments section.
In this paper we introduce a new estimation algorithm for , termed STVE (Spectrum Thresholding Variance Estimator), and prove finite sample complexity bounds for it. In particular, our bounds are an explicit function of the parameters and for any finite , and indicate that the estimation error decays roughly as , with high probability. To the best of our knowledge, these are the first bounds of this kind. As we discuss in detail in Section 2, most existing estimation methods for LDSs, such as subspace identification (van Overschee and de Moor, 1996), or improper learning (Anava et al., 2013; Hazan et al., 2017; Kozdoba et al., 2019), do not apply to the system (1)-(2), due to non-stationarity. On the other hand, the methods that do apply to (1)-(2) lack guarantees and rely strongly on Guassianity of the noises.
Moreover, our approach differs significantly from the existing methods. We show that the structure of equations (1)-(2) is closely related to, and inherits several important properties from the classical discrete Laplacian operator on the line — a line of reasoning that was not explored in the literature so far. In particular, we use this connection to show that an appropriate inversions of the system produce estimators that are concentrated enough so that and may be recovered.
The rest of the paper is organized as follows: The related work is discussed in Section 2. In Section 3 we describe in general lines the methods and main results of this paper. The main technical statements are given in the Section 4. In Section 5 we present experimental results on synthetic and real data, and conclusions and future work are discussed in Section 6. Due to space constraints, while we outline the main arguments in the text, the full proofs are deferred to the Appendix.
Existing approaches to the variance estimation problem may be divided into three categories: (i) General methods for parameter identification in LDS, either via maximum likelihood estimation (MLE) (Hamilton, 1994), or via subspace identification (van Overschee and de Moor, 1996). (ii) Methods designed specifically to learn the noise parameters of the system, developed primarily in the control theory community, in particular via the innovation auto-correlation function, such as the classical Mehra (1970); Belanger (1974), or for instance more recent Wang et al. (2017); Dunik et al. (2018). (iii) Improper Learning methods, such as Anava et al. (2013); Hazan et al. (2017); Kozdoba et al. (2019). In these approaches, one does not learn the LDS directly, but instead learns a model from a certain auxiliary class and shows that this auxillary model produces forecasts that are as good as the forecasts of an LDS with “optimal” parameters.
Despite the apparent simplicity of the system (1)-(2), most of the above methods do not apply to this system. This is due to the fact that most of the methods are designed for time invariant, asymptotically stationary systems, where the observation operator ( in our notation) is constant and the Kalman gain (or, equivalently in eq. (3)) converges with . However, if the observation vector sequence changes with time – a necessary property for the dynamic regression problem – the system will no longer be asymptotically stationary. In particular, due to this reason, neither the subspace identification methods, nor any of the improper learning approaches above apply to system (1)-(2) .
Among the methods that do apply to (1)-(2) are the general MLE estimation, and some of the auto-correlation methods (Belanger, 1974; Dunik et al., 2018). On one hand, both types of approaches may be applicable to systems apriori more general than (1)-(2). On the other hand, the situation with consistency guarantees – the guarantee that one recovers true parameters given enough observations – for these methods is somewhat complicated. Due to the non-convexity of the likelihood function, the MLE method is not guaranteed to find the true maximum, and as a result the whole method has no guarantees. The results in Belanger (1974); Dunik et al. (2018) do have asymptotic consistency guarantees. However, these rely on some explicit and implicit assumptions about the system, the sequence in our case, which can not be easily verified. In particular, Belanger (1974); Dunik et al. (2018) assume uniform observability of the system, which we do not assume, and in addition rely on certain assumption about invertibility and condition number of the matrices related to the sequence . Moreover, even if one assumes that the assumptions hold, the results are purely asymptotic, and for any finite , do not provide a bound of the expected estimation error as a function of and .
In addition, as mentioned earlier, MLE methods by definition must assume that the noises are Gaussian (or belong to some other predetermined parametric family) and autocorrelation based methods also strongly use the Gaussianity assumption. Our approach, on the other hand, requires only sub Gaussian noises with independent coordinates. It is an interesting question whether our approach may be extended further, by removing the independent coordinates assumption. The operator analysis part of this paper does not depend on the distribution of the noises. Therefore, to achieve such an extension, one would only need to correspondingly extend our main probabilistic tool, the Hanson-Wright inequality (Hanson and Wright, 1971; Rudelson et al., 2013, see also Section 3 and Appendix E). It is worth noting that such an extension was recently obtained in Hsu et al. (2012), but only for upper deviations, rather than for concentration around the mean. In this form, it is not sufficient for the purposes of this work.
3 Overview of the approach
We begin by rewriting (1)-(2) in a vector form. To this end, we first encode sequences of vectors in , , as a vector , constructed by concatenation of ’s. Next, we define the summation operator which acts on any vector by
Note that is an invertible operator. Next, we similarly define the summation operator , an -dimensional extension of , which sums -dimensional vectors. Formally, for , and for , . Observe that if the sequence of process noise terms is viewed as a vector , then by definition is the encoding of the sequence .
Next, given a sequence of observation vectors , we define the observation operator by . In words, coordinate of is the inner product between and -th part of the vector . Define also to be the concatenation of . With this notation, one may equivalently rewrite the system (1)-(2) as follows:
where and are independent zero-mean random vectors in and respectively, with independent sub Gaussian coordinates. The variance of each coordinate of is and each coordinate of has variance .
Up to now, we have reformulated our data model as a single vector equation. Note that in that equation, the observations and both operators and are known to us. Our problem may now be reformulated as follows: Given , assuming was generated by (7), provide estimates of .
As a motivation, we first consider taking the expectation of the norm squared of eq. (7). For any operator and zero-mean vector with independent coordinates and coordinate variance , we have , where is the Hilbert-Schmidt (or Frobenius) norm of . Taking the norm and expectation of (7), and dividing by , we thus obtain
Next, note that is known, and an elementary computation shows that is of constant order (as a function of ; see (24)), while the coefficient of is . Thus, if the quantity were close enough to its expectation with high probability, we could take this quantity as a (slightly biased) estimator of . However, as it will become apparent later, the deviations of around the expectation are also of constant order, and thus can not be used as an estimator. The reason for these high deviations of is that the spectrum of is extremely peaked. The highest squared singular value of is of order , the same order as sum of all of them, . Contrast this with the case of identity operator, : We have , and one can also easily show that, for instance, , and thus the deviations are of order – a smaller order than . While for the identity operator the computation is elementary, for a general operator the situation is significantly more involved, and the bounds on the deviations of will be obtained from the Hanson-Wright inequality (Hanson and Wright, 1971, see also Rudelson et al. (2013)), combined with standard norm deviation bounds for isotropic sub Gaussian vectors.
With these observations in mind, we proceed to flatten the spectrum of by taking the pseudo-inverse. Let be the pseudo-inverse, or Moore-Penrose inverse of . Specifically, let
be the singular value decomposition of , where are the singular values.
For the rest of the paper, we will assume that all of the observation vectors are non-zero. This assumption is made solely for notational simplicity and may easily be avoided, as discussed later in this section. Under this assumption, since is invertible and has rank , we have for all . For , denote . Then are the singular values of , arranged in a non-increasing order, and we have by definition
In this equation, the deviation term is of order with high probability (Theorem 1). Moreover, the coefficient of is , and the coefficient of , which is , is of order at least (Theorem 7) – much larger order than . Since and are known, it follows that we have obtained one equation satisfied by and up to an error of , where both coefficients are of order larger than the error.
Next, we would like to obtain another linear relation between , . To this end, choose some , where is of constant order. The possible choices of are discussed later in this section. We define an operator to be a version of truncated to the first singular values. If (10) is the SVD decomposition of , then
Similarly to the case for , we have
Now, given two equations in two unknowns, we can solve the system to obtain the estimates and . The full procedure is summarized in Algorithm 1, and the bounds implied by Theorem 1 on the estimators and are given in Corollary 2. We first state these results, and then discuss in detail the various parameters appearing in the bounds.
Consider a random vector of the form
Let be the estimators of obtained form Algorithm 1 with . Set . Then for any , with probability at least ,
We first discuss the assumption . This assumption is made solely for notational convenience, as detailed below. To begin, note that some form of lower bound on the norms of the observation vectors must appear in the bounds. This is simply because if one had for all , then clearly no estimate of would have been possible. On the other hand, our use of the smallest value may seem restrictive at first. We note however, that instead of considering the observation operator , one may consider the operator for any subsequence . The observation vector would be correspondingly restricted to the subsequence of indices. This allows us to treat missing values and to exclude any outlier with small norms. All the arguments in Theorems 1 and 7 hold for this modified without change. The only price that will be paid is that will be replaced by in the bounds. Moreover, we note that typically we have by construction, see for instance Section 5.2. Additional discussion of missing values may be found in Appendix H.
Next, up to this point, we have obtained two equations, (11)-(12), in two unknowns, . Note that in order to be able to obtain from these equations, at least one of the coefficients of , either or must be of larger order than , the order of deviations. Providing lower bounds on these quantities is one of the main technical contributions of this work. This analysis uses the connection between the operator and the Laplacian on the line, and resolves the issue of translating spectrum estimates for the Laplacian into the spectral estimates for . We note that there are no standard tools to study the spectrum of , and our approach proceeds indirectly via the analysis of the nuclear norm of . These results are stated in Theorem 7 below. In particular, we show that is , which is the source of the log factor in (16).
Finally, in order to solve the equations (11)-(12), not only the equations must have large enough coefficients, but the equations must be different. This is reflected by the term in (15), (16). Equivalently, while by definition, we would like to have
It is worth emphasizing that for simple choices of , say , the condition (17) does hold in practice. Note that, for any , we can have only if the spectrum of is constant. Thus (17) amounts to stating that the spectrum of exhibits some decay. As we show in experiments below, the spectrum (squared) of , for derived from daily temperature features, or for random Gaussian , indeed decays. See Section 5, Figures 0(a) and 0(b). In particular, in both cases (17) holds with .
While the quantity may simply be taken as a parameter of the problem, which measures the decay of , it is nevertheless desirable to have some simpler bounds on this quantity. Here we provide a partial result in this direction, which bounds under an additional condition on .
Given the sequence , define the scalar sequence and set
Then for ,
The general idea behind the proof of Proposition 3 is to show that for , the ratio can be controlled. This is done in Lemmas 11 and 12 in Appendix G. In particular, Lemma 11 is a general statement about integrals of monotone real functions under certain order constraints, and Lemma 12 provides a relation of the spectrum of to that of . It is then shown that for arbitrary , contains a certain copy of an -dimensional operator with parameters , which implies the bounds.
4 Properties of and
In this section we introduce the necessary notation, and state the key properties of the operators and defined in Section 3. We refer to Bhatia (1997) and Vershynin (2018) as general references on the notation introduced below, for operators and sub Gaussian variables, respectively.
Let be an operator with a singular value decomposition , where and . Note that singular values are strictly positive by definition (that is, vectors corresponding to the kernel of do not participate in the decomposition ). The Hilbert-Schmidt (Frobenius), nuclear and operator norms are defined, respectively, as
A centered () random variable is sub-Gaussian with constant , denoted , if for all it satisfies
A random vector is sub-Gaussian, denoted , if for every with the random variable is sub-Gaussian. A random vector is -isotropic if for every with , .
Finally, a random vector is -isotropically sub-Gaussian with independent components, denoted if are independent, and for all , , and . Clearly, if then is -isotropic. Recall also that implies (Vershynin, 2018). The noise variables we discuss in this paper are .
Finally, throughout the paper, absolute constants are denoted by . etc. Their values may change from line to line.
The main result of this Section is Theorem 7. To prove it we require several Lemmas. First, we obtain the following bounds on the spectrum of .
The singular values of satisfy the following:
The proof of this lemma uses the following auxiliary result: Let be the difference operator, . In the field of Finite Difference methods, the operator is known as the Laplacian on the line, or as the discrete derivative with Dirichlet boundary conditions, and is well studied. The eigenvalues of may be derived by a direct computation, and correspond to the roots of the Chebyshev polynomial of second kind of order . In particular, the following holds:
The operator has kernel of dimension and singular values for .
We refer to Mitchell and Griffiths (1980) for the proof of Lemma 5. Next, in Lemma 6, we show that the inverse of the operator , defined in (6), is a one dimensional perturbation of , which implies bounds on singular values of .
The singular values of satisfy
for , and
Finally, as discussed in Section 3, the following bounds play an essential role in the analysis of the variance estimators:
Let be the pseudoinverse of . Then
In this section we empirically evaluate the STVE algorithm. In the Synthetic Data section we demonstrate the method on a simulated system with Gaussian random inputs . We verify in particular that the estimation error decays roughly at the rate of as grows, as predicted by the theory. In Temperatures and Electricity Consumption section we study the relation between daily temperatures and electricity consumption via the dynamic regression model. We compare the following methods: a stationary regression, an online gradient, and a Kalman filter for a dynamic regression, with parameters learned via MLE or STVE. We find that the Kalman filter methods provide the best performance, with no significant difference between STVE and MLE derived systems.
5.1 Synthetic Data
We generated synthetic data by the LDS (1)-(2) where . The values of where sampled from . We then run the STVE algorithm for different values of , where for each we sampled the data times. Figure 0(c) shows the average (over 150 runs) estimation error for both process and observation noise variances for various values of . As expected from the bounds in Corollary 2, it may be observed in Figure 0(c) that the estimation errors decay roughly at the rate of . A typical spectrum of is shown in Figure 0(b). For larger , the spectra also exhibits similar decay.
5.2 Temperatures and Electricity Consumption
In this section we examine the relation between daily temperatures and electricity consumption in the data from Hong et al. (2014) (see also Hong (2016)). In this data, we have total daily electricity consumption (load) for a certain region, for the period Jan- to Jun-. We also have average daily temperatures, , for the region. We refer to Appendix I for additional details on the data, the preprocessing, and an extended discussion of the analysis.
An elementary inspection of the data reveals that the load may be reasonably approximated as a quadratic function of the temperature,
where is the observation vector
(features), and is
the possibly time varying regression vector.
This is shown in Figure
1(a), where we fit
a stationary (time invariant) regression of the form
(30), using either only the
first or only the second half of the data. We note
that these regressions differ
We use the first half of the data (train set) to learn the parameters of the online regression (1)-(2) via MLE optimization and using STVE. We also use the train set to find the optimal learning rate for the OG forecaster described by the update equation (5). This learning rate is chosen as the rate that yields smallest least squares forecast error on the train set, and is found via the L-BGFS-B optimization algorithm. In addition, we learn a time independent, stationary regression on the first half of the data.
We then employ the learned parameters to make predictions of the load given the temperature, by all four methods. The predictions for the system (1)-(2) are made with a Kalman filter (at time , we use the filtered state estimate , which depends only on and , and make the prediction ).
Daily squared prediction errors (that is, ) are shown in Figure 1(b) (smoothed with a moving average of 50 days). We see that the adaptive models (MLE, STVE, OG) outperform the stationary regression already on the train set (first half of the data), and that the difference in performance becomes dramatic on the second half (test). It is also interesting to note that the performance of the Kalman filter based methods (MLE, STVE) is practically identical, but both are somewhat better than the simpler OG approach (see also Figure 5 in Appendix I).
Finally, we note that by construction, we have in this experiment, due to the constant coordinate, and also , due to the normalization. Since these operations are typical for any regression problem, we conclude that the direct influence of and on the bounds in Theorem 1 and Corollary 2 will not usually be significant.
6 Conclusion and Future Work
We conclude by mentioning a few extensions and open questions. There are a few extensions that may be obtained by our methods virtually without change. For instance, an extension to vector rather than scalar valued observations is straightforward and can be done along the lines of the current arguments. An extension to control settings, where the equation of the state (1) is , with known inputs is also immediate. Next, it is easy to see that the error rate of the estimators in Corollary 2, , is optimal in . Indeed, even for a single sub Gaussian variable , the error rate of the standard variance estimator is lower-bounded by , by the Central Limit Theorem. However, it is not clear whether the dependence on and the dimension in (15)-(16) is also tight. Finally, a question of independent interest is the question of stability: Given error bounds for , what can be said about the prediction error of the Kalman filter with parameters compared to the error of the filter with true parameters ? To the best of our knowledge, this fundamental question is still open.
Variance Estimation For Online Regression via Spectrum Thresholding - Appendix
Appendix A Outline
This Appendix is organized as follows: Theorems 1, 7, Corollary 2, and all intermediate results are proved in Sections B to F. In Section G we prove Proposition 3. Section H contains the discussion of missing values in STVE. Additional details on the electricity consumption experiment are given in Section I.
Finally, Figure 3 illustrates the behavior of the Kalman filter on a fixed data for various values of . The data (observations, red) was generated from the system (1)-(2) with one dimensional state (), and we set . The states produced by the Kalman filter with ground truth values of are shown in green, while states obtained from other choices of parameters are shown in black and blue. Clearly, with wrong choices of , the filter can either severely lag after the signal, or overfit to the latest observation.
Appendix B Proof of Lemma 6
Recall that the operator is given by and the operator is given by . Let be the subspace spanned by all but the first coordinate. Let be the projection onto , i.e. a restriction to second to ’th coordinate. Observe that the action of is equivalent to that of . Therefore the singular values of are identical to those of . To obtain bounds on the singular values of , note that – that is, is a compression of . Thus, by the Cauchy’s Interlacing Theorem (Bhatia, 1997, Corollary III.1.5),
for all . In conjunction with Lemma 5 this provides us with the estimates for all but the smallest singular value of (since ). We therefore estimate directly, by bounding the norm of . Indeed, for any , by the Cauchy-Schwartz inequality,
Thus we have , which concludes the proof of the Lemma. ∎
Appendix C Proof of Lemma 4
By Lemma 6,
Note also that is by definition a collection of independent copies of , and therefore the spectrum of is that of , but each singular value is taken with multiplicity . In particular it follows that (34) holds also for itself. Since clearly , the upper bound on in (22) follows from (34).
For the lower bound, denote by the orthogonal complement to the kernel of , . Denote by the orthogonal projection onto . We have in particular that . Next, the operator maps the unit ball of into an ellipsoid , and by (34), we have . It therefore follows that
where is the unit ball of . It remains to observe that for every , we have
To derive (23), recall that the nuclear norm is sub-multiplicative with respect to the operator norm:
This follows for instance from the characterization of the nuclear norm as trace dual of the operator norm (Bhatia, 1997, Propositions IV.2.11, IV.2.12). Next, since the spectrum of is the spectrum of taken with multiplicity , we have
and it remains to bound the nuclear norm of .
Using the inequality
and Lemma 6, we have