Robust and Trend-following Student’s t-Kalman Smoothers

Robust and Trend-following Student’s t-Kalman Smoothers

Aleksandr Y. Aravkin. IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598 (    James V. Burke Mathematics Dept., University of Washington, Seattle, WA 98195 (    Gianluigi Pillonetto July 5, 2019 Control and Dynamic Systems Department of Information Engineering at the University of Padova, Padova, Italy (

We present a Kalman smoothing framework based on modeling errors using the heavy tailed Student’s t distribution, along with algorithms, convergence theory, open-source general implementation, and several important applications. The computational effort per iteration grows linearly with the length of the time series, and all smoothers allow nonlinear process and measurement models.

Robust smoothers form an important subclass of smoothers within this framework. These smoothers work in situations where measurements are highly contaminated by noise or include data unexplained by the forward model. Highly robust smoothers are developed by modeling measurement errors using the Student’s t distribution, and outperform the recently proposed -Laplace smoother in extreme situations with data containing 20% or more outliers.

A second special application we consider in detail allows tracking sudden changes in the state. It is developed by modeling process noise using the Student’s t distribution, and the resulting smoother can track sudden changes in the state.

These features can be used separately or in tandem, and we present a general smoother algorithm and open source implementation, together with convergence analysis that covers a wide range of smoothers. A key ingredient of our approach is a technique to deal with the non-convexity of the Student’s t loss function. Numerical results for linear and nonlinear models illustrate the performance of the new smoothers for robust and tracking applications, as well as for mixed problems that have both types of features.

1 Introduction

The Kalman filter is an efficient recursive algorithm for estimating the state of a dynamic system [kalman]. Traditional formulations are based on penalties on model deviations, and are optimal under assumptions of linear dynamics and Gaussian noise. Kalman filters are used in a wide array of applications including navigation, medical technologies, and econometrics [Chui2009, West1991, Spall03]. Many of these problems are nonlinear, and may require smoothing over past data in both online and offline applications to significantly improve estimation performance [Gelb].
This paper focuses on two important areas in Kalman smoothing: robustness with respect to outliers in measurement data, and improved tracking of quickly changing system dynamics. Robust filters and smoothers have been a topic of significant interest since the 1970’s, e.g. see [Schick1994]. Recent efforts have focused on building smoothers that are robust to outliers in the data [AravkinL12011, AravkinIFAC, Farahmand2011], using convex loss functions such as , Huber or Vapnik, in place of the penalty [Hastie01].

There have also been recent efforts to design smoothers able to better track fast system dynamics, e.g. jumps in the state values. A contribution can be found in [Ohlsson2011] where the Laplace distribution, rather than the Gaussian, is used to model transition noise. This introduces an penalty on the state evolution in time, resulting in an estimator interpretable as a dynamic version of the well known LASSO procedure [Lasso1996].

For known dynamics, all of the smoothers mentioned above can be derived by modeling the process and the measurement noise using log-concave densities, taking the form


Formulations exploiting (LABEL:logconcave) are nearly ubiquitous, in part because they correspond to convex optimization problems in the linear case. However, in order to model a regime with large outliers or sudden jumps in the state, we want to look beyond (LABEL:logconcave) and allow heavy-tailed densities, i.e. distributions whose tails are not exponentially bounded. All such distributions necessarily have non-convex loss functions [AravkinFHV:2012, Theorem 2.1].

Several interesting candidates are possible, but in this contribution we focus on the Student’s t-distribution for its convenient properties in the context of the applications we consider. The Student’s t-distribution was successfully applied to a variety of robust inference applications in [Lange1989], and is closely related to re-descending influence functions [Hampel].

In this work, we propose a smoothing framework for several applications, including robust and trend smoothing. The T-Robust smoother is derived from a dynamic system with output noise modeled by the Student’s t-distribution. This is a further robustification of the estimator proposed in [AravkinL12011], which uses the Laplace density. The re-descending influence function of the Student’s t guarantees that outliers in the measurements have less of an effect on the smoothed estimate than any convex loss function. In practice, the T-Robust smoother performs better than [AravkinL12011] for cases with a high proportion of outliers. The T-Trend smoother is similarly derived starting from a dynamic system with transition noise modeled by the Student’s t-distribution. This allows T-Trend to better track sudden changes in the state. One may consider using both aspects simultaneously; in addition, practitioners need the ability to distinguish between different measurements based on prior information of measurement fidelity, and between different states based on prior knowledge of trend stability.

In the context of Kalman filtering/smoothing, the idea of using Student’s t-distributions to model the system noise for robust and tracking applications was first proposed in [Fahr1998]. However, our work differs from that approach in some important aspects. First, our analysis includes nonlinear measurement and process models. Second, we provide a novel approach to overcome the non-convexity of the Student’s t-loss function. Third, the approach we propose can be used to solve any smoothing problem that uses Student’s t modeling for any process or measurement components.

The basic approach differs significantly from the one proposed in [Fahr1998]. [Fahr1998] proposes using the random information matrix (i.e. full Hessian) when possible, or its expectation (Fisher information) when the Hessian is indefinite. Instead, we propose a modified Gauss-Newton method which builds information about the curvature of the Student’s t-log likelihood into the Hessian approximation, and is guaranteed to be positive definite. As we show in Section LABEL:Algorithm, the new approach is provably convergent, and unlike the approach in [Fahr1998] uses information about the relative sizes of the residuals in computing descent directions. These differences make it more stable than methods using random information, and more efficient than methods using Fisher information.
The major computational tradeoff in using non-convex penalties is that the loss function in the convex case is used directly [AravkinL12011], i.e. is not approximated, whereas in the nonconvex case, the loss function must be iteratively approximated with a local convex approximation. This requires a fundamental extension of the convergence analysis.

A conference proceeding previewing this paper appears in [SYSID2012tks]. In the current work, we present a general smoothing framework that includes the two smoothers presented in [SYSID2012tks] as special cases, together with a generalized convergence theory that covers the entire range of smoothers under discussion. We also provide an open-source implementation of the general algorithm [ckbs], with a simple interface that enables the user to customize which residual or innovation components to model using the Student’s t penalty. Using this implementation, we present an expanded experimental section, and new experiments that show how robust and trend smoothing can be done simultaneously. Finally, we apply the smoothers to real data.

The paper is organized as follows. In Section LABEL:StudentT-KF, we introduce the multivariate Student’s t-distribution, review its advantages for error modeling over log-concave distributions, and introduce the dynamic model class of interest for Kalman smoothing. In Section LABEL:GenT, we describe a statistical modeling framework, where we can use Student’s t to model any process or measurement residual components. We describe all objectives that can arise this way, and provide a comprehensive method for obtaining approximate second order information for these objectives. In Section LABEL:SpecialCases, we provide details for three important special smoothers: T-Robust (robust against large measurement noise), T-Trend (able to follow sharp changes in the state), and the Double-T smoother (incorporates both aspects). In Section LABEL:Algorithm, we present the algorithm and a convergence theory for the entire framework, which also extends the convergence theory developed in [AravkinL12011]. In Section LABEL:SimulationRobust, we present numerical experiments that illustrate the behavior of all three special smoothers, include illustrations of linear and nonlinear models, and results for real and simulated data. We end the paper with concluding remarks.

2 Error Modeling with Student’s t

Figure 2.1: Gaussian, Laplace, and Student’s t Densities, Corresponding Negative Log Likelihoods, and Influence Functions.

For a vector and any positive definite matrix , let . We use the following generalization of the Student’s t-distribution:


where is the mean, is the degrees of freedom, is the dimension of the vector , and is a positive definite matrix. A comparison of this distribution with the Gaussian and Laplacian distribution appears in Figure LABEL:GLT-KF. Note that the Student’s t-distribution has much heavier tails than the others, and that its influence function is re-descending, see [Mar] for a discussion of influence functions. This means that as we pull a measurement further and further away, its ‘influence’ decreases to 0, so it is eventually ignored by the model. Note also that the -Laplace is peaked at 0, while the Student’s t-distribution is not, and so a Student’s t-fit will not in general drive residuals to be exactly .

Before we proceed with the Kalman smoothing application, we review a result from [AravkinFHV:2012], illustrating the fundamental modeling advantages of heavy tailed distributions:

Theorem 2.1

Consider any scalar density arising from a symmetric convex coercive and differentiable penalty via , and take any point with . Then for all , the conditional tail distribution induced by satisfies


When is large, the condition indicates that we are looking at an outlier. However, as shown by the theorem, any log-concave statistical model treats the outlier conservatively, dismissing the chance that could be significantly bigger than . Contrast this behavior with that of the Student’s t-distribution. When , the Student’s t-distribution is simply the Cauchy distribution, with a density proportional to . Then we have that

Heavy tailed distributions thus provide a fundamental advantage in cases where outliers may be particularly large, or, in the second application we discuss, very sudden trend changes may be present.

We now turn to the Kalman smoothing framework. We use the following general model for the underlying dynamics: for


with initial condition , with a known constant, and where are known smooth process functions, and are known smooth measurement functions. Moreover, and are mutually independent, and with known covariance matrices and , respectively. Note that here we assume all the measurement vectors have consistent dimension . There is no loss of generality compared to the standard model where the dimensions depend on , since any measurement vector can be augmented to a standard size , and then the phantom measurements can be disabled using the modeling interface (by setting corresponding columns and rows of to .)

We now briefly explain how to use Student’s t error modeling to design smoothers with two important characteristics. In order to obtain smoothers that are robust to heavily contaminated data, the vector can be modeled zero-mean Student’s t measurement noise (LABEL:StudentDensity) of known covariance and degrees of freedom . To design smoothers that can track sudden changes in the state, the process residuals are modeled using Student’s t noise. These features may be employed separately or in tandem, and we always assume that the vectors are all mutually independent.

In the next section, we design a smoother that finds the MAP estimates of for a general formulation, where Student’s t or least squares modeling can be used for any or all process and measurement residuals. We then specialize it to recover the applications discussed above.

3 Generalized Smoothing Framework

Given a sequence of column vectors and matrices we use the notation

We also make the following definitions:

In the most general case, we suppose that any of the components or components can be modeled either using Gaussian or Student’s t distributions.

For the sake of modeling clarity, assume that subcomponents of measurement and innovation residuals are consistently modeled across time points ; this gives the user the ability to select which subvectors of process and measurement residuals to model using Student’s t, but not to assign different penalties to different time points.

Denote by and the subvectors of the innovation residuals , and denote by and the subvectors of the measurement residuals that are to be modeled using the Gaussian and Student’s t distributions, respectively. Assume that all of these subvectors are mutually independent, and denote the corresponding covariance submatrices by , , , and . Maximizing the likelihood for this model is equivalent to minimizing the associated negative log likelihood

which can be explicitly written as follows:


where and are degree of freedom parameters corresponding to and .

A first-order accurate affine approximation to our model with respect to direction near a fixed state sequence is given by

Set and (where is the identity matrix) so that the formulas are also valid for .

We minimize the nonlinear nonconvex objective in (LABEL:fullObjectiveGeneral) by iteratively solving quadratic programming (QP) subproblems of the form:


where is the gradient of objective (LABEL:fullObjectiveRobust) with respect to and has the form


Note that this matrix is symmetric block tridiagonal. This structure is essential to the computational results for a wide variety of Kalman filtering and smoothing algorithms; it was noted early on in [Wright1990, Fahr1991].

In order to fully describe and , first let , denote the indices associated to all subvectors and within . For example, if the Student’s t density is used for all measurement residuals, and the Gaussian penalty is used for all process residuals, then , .

Now define with as follows:

The entries of and not explicitly defined in (LABEL:Ak) and (LABEL:Ck) are set to .

The Hessian approximation terms in (LABEL:Hk) are motivated in Section LABEL:Algorithm, and are crucial to both practical performance and theoretical convergence analysis. The solutions to the subproblem (LABEL:QuadraticSubproblemOne) have the form , and can be found in an efficient and numerically stable manner in steps, since is tridiagonal and positive definite (see [Bell2008]).

4 Special cases

We know show how the general framework of the previous section can be specialized to obtain three smoothers. The first two are T-Robust and T-Trend, which are presented in [SYSID2012tks]. The third is a new smoother where all residuals and innovations are modeled using Student’s t.

The objective corresponding to T-Robust is obtained from (LABEL:fullObjectiveGeneral) by taking , , , :


The terms in (LABEL:Ak)—(LABEL:Hk) become


The objective corresponding to T-Trend is obtained from (LABEL:fullObjectiveGeneral) by taking , , , :


The terms in (LABEL:Ak)—(LABEL:Hk) become


Finally, we can apply Student’s t to all process and measurement residuals by taking , , , to obtain


The terms in (LABEL:Ak)—(LABEL:Hk) become


5 Algorithm and Global Convergence

When models and are linear, we can compare the algorithmic scheme proposed in the previous sections with the method in [Fahr1998]. The latter uses the random information matrix (random Hessian) in place of the matrix defined above, and recommends using the expected (Fisher) information when the full Hessian is indefinite. When the densities for and are Gaussian, this is equivalent to using Newton’s method when possible, and using Gauss-Newton when the Hessian is indefinite. In general, using the expected information is known as the method of Fisher’s scoring. In the Student’s t-case, the scalar Fisher information matrix is computed in [Lange1989] to be


where is the variance and is the degrees of freedom. The authors of [Fahr1998] proposed using (LABEL:ExpectedFisher) as the Hessian approximation when the full Hessian is indefinite. Implementing this approach would effectively replace the terms or , present in the denominators of and (see LABEL:TrobH and LABEL:TrendH), with terms that depend only on and , the degrees of freedom. So while the random information (Hessian) matrix can become indefinite, the Fisher information is insensitive to outliers, and fails to down-weigh their contributions to the Hessian approximation.

To overcome these drawbacks, and find a middle ground between using the full Hessian and using a very rough approximation, we propose a Gauss-Newton method that is able to incorporate the relative size information of the residuals into the Hessian approximation. In the rest of this section we provide the details for the application of this method and a proof of convergence.

As in [AravkinL12011], the convergence theory is based upon the versatile convex-composite techniques developed in [Burke85]. We begin by choosing the convex-composite structure for objective (LABEL:fullObjectiveGeneral). We write it in the convex-composite form , with smooth and convex :


Note that the range of is , and is coercive on its domain. The terms indexed with superscript in (LABEL:Ck) and (LABEL:Hk) combine to form a positive definite approximation to the Hessian of . To see this, consider the scalar function

The second derivative of this function in is given by


and is only positive on . There are two reasonable globally positive approximations to take. The first,

simply ignores the subtracted term . In practice, we found this approximation to be too aggressive. Instead, we drop the from the left of (LABEL:LogHessian) to obtain the approximation


Similarly, the terms indexed by superscript in (LABEL:Ck) and (LABEL:Hk) provide globally positive definite approximations to the Hessian of , using the strategy in (LABEL:LogHessianApprox). This strategy offers a significant computational advantage—the Hessian approximation that is built up down-weights the contributions of outliers, helping the algorithm proceed faster to the solution. As we shall see, these terms are also essential for the general convergence theory.

Our approach exploits the objective structure by iteratively linearizing about the iterates and solving the direction finding subproblem


where is a symmetric positive semidefinite matrix that depends continuously on . For any smoother in the framework of section LABEL:GenT, problem (LABEL:DirectionFindingSubproblem) can be solved with a single block-tridiagonal solve of the system (LABEL:QuadraticSubproblemOne), yielding descent directions for the objective .

We now develop a general convergence theory for convex-composite methods to establish the overall convergence to a stationary point of . This theory is in the spirit of [AravkinL12011] and [Burke85], and allows the inclusion of the quadratic term in (LABEL:DirectionFindingSubproblem). This term was not necessary in [AravkinL12011], but is crucial here. Note that the theory does not rely at all on the technique used to solve the direction finding subproblem, and so the theory in this paper applies to the algorithm in [AravkinL12011] by taking .

Recall from [Burke85] that the first-order necessary condition for optimality in the convex composite problem involving is

where is the generalized subdifferential of at [RTRW] and is the convex subdifferential of at [Rock70]. Elementary convex analysis gives us the equivalence

For the general smoothing class of interest, it is desirable to modify this objective by including curvature information, yielding the problem (LABEL:DirectionFindingSubproblem). We define the difference function


where is positive semidefinite and varies continuously with . Note that is a convex function of that is bounded below, hence the optimal value


is well defined regardless of the existence of a solution. If , then . Hence, by [Burke85, Theorem 3.6], if and only if .

Given , we define a set of search directions at by


Note that if there is a such that , then . These ideas motivate the following algorithm.

Algorithm 5.1

Gauss-Newton Algorithm.

The inputs to this algorithm are

  • : initial estimate of state sequence

  • : overall termination criterion

  • : search direction selection parameter

  • : step size selection parameter

  • : line search step size factor

The steps are as follows:

  1. Set the iteration counter .

  2. (Gauss-Newton Step) Find in the set in LABEL:extendedDSet. Set in LABEL:extendedDelta and Terminate if .

  3. (Line Search) Set

  4. (Iterate) Set and return to Step LABEL:GaussNewtonStep.

We now present a general global convergence theorem that covers any smoother in section LABEL:GenT. This theorem also generalizes [AravkinL12011, Theorem 5.1] to include positive semidefinite curvature terms in the Gauss-Newton framework.

Theorem 5.1



and suppose that there exists a such that is bounded and uniformly continuous on the set


If is a sequence generated by the Gauss-Newton Algorithm LABEL:GaussNewtonAlgorithm with initial point and , then one of the following must occur:

  1. The algorithm terminates finitely at a point with .

  2. The sequence diverges to .

  3. for every subsequence for which the set is bounded.

Moreover, if is any cluster point of a subsequence such that the subsequence is bounded, then .

Proof. We will assume that none of (i), (ii), (iii) occur and establish a contradiction. Then there is a subsequence such that

Since is a decreasing sequence that is bounded below by , we know that the differences . Therefore, by Step 3) of Algorithm LABEL:GaussNewtonAlgorithm, , which implies that . Without loss of generality we may assume that and for all . Hence for all ,

where is a bound on over . Let be a Lipschitz constant for over the compact set . Again by Step 3) of Algorithm LABEL:GaussNewtonAlgorithm, for all ,

where is the modulus of continuity of on . Rearranging, we obtain

Taking the limit for , we obtain the contradiction . Hence, , which implies that , since .

Finally, suppose that is a cluster point of a sequence for which is bounded. Without loss of generality, there exists a such that . For all ,

where . Taking the limit over gives

where . Since was chosen arbitrarily, it must be the case that , which implies that by [Burke85, Theorem 3.6].     

A stronger convergence result is possible under stronger assumptions on and .

Corollary 5.2

Suppose that is bounded, and there exists such that


If is a sequence generated by Algorithm LABEL:GaussNewtonAlgorithm with initial point and , then and are bounded and either the algorithm terminates finitely at a point with , or as , and every cluster point of the sequence satisfies .

Proof. First note that is closed since is continuous, and therefore is compact, since by assumption it is bounded. Hence (see (LABEL:snot)) is also compact. Therefore, is uniformly continuous and bounded on which implies that the hypotheses of Theorem LABEL:GlobalConvergenceTheorem are satisfied, and so one of (i)-(iii) must hold. If (i) holds we are done, so we will assume that the sequence is infinite. Since , this sequence is bounded. We now show that the sequence of search directions is also bounded.

Suppose that (LABEL:Assumption) holds. For any direction , note that satisfies


Since , we have




Suppose that the is unbounded. Then without loss of generality, there exists a subsequence , a unit vector , and a vector such that and . Since is bounded, (LABEL:linearization) implies that , so , and therefore

On the other hand, by (LABEL:bdDir), and so in the limit we have the contradiction

Hence are bounded. The result now follows from Theorem LABEL:GlobalConvergenceTheorem.


We now show that all smoothers of section LABEL:GenT satisfy the required assumptions of Theorem LABEL:GlobalConvergenceTheorem and Corollary (LABEL:GlobalConvergenceCorollary).

Corollary 5.3 (Smoother Satisfaction)

Suppose that the process and measurement functions and in (LABEL:NonlinearStudentModel) are twice continuously differentiable. Then for given in (LABEL:FDef), is bounded and uniformly continuous on in (LABEL:snot). Moreover, the hypotheses of Corollary LABEL:GlobalConvergenceCorollary hold if for all in and for all , there exists such that

Proof. We first show that both and are bounded. The first claim follows immediately by the coercivity of in (LABEL:rhoDef). To verify the second claim, we will show that for any sequence of with , we can find a subsequence such that , which implies the existence of subsequence such that either or . In particular there does not exist an unbounded sequence with , and therefore must be bounded.

If , we can find an index and subsequence such that . Now, either and we are done, or , so . Iterating this argument, we arrive at the limiting cas