Robustification of Elliott’s on-line EM algorithm for HMMs

# Robustification of Elliott’s on-line EM algorithm for HMMs

Christina Erlwein and Peter Ruckdeschel

Fraunhofer ITWM Department of Financial Mathematics Fraunhofer-Platz 1, D-67663 Kaiserslautern Christina.Erlwein@itwm.fraunhofer.de Peter.Ruckdeschel@itwm.fraunhofer.de
July 5, 2019
###### Abstract

In this paper, we establish a robustification of an on-line algorithm for modelling asset prices within a hidden Markov model (HMM). In this HMM framework, parameters of the model are guided by a Markov chain in discrete time, parameters of the asset returns are therefore able to switch between different regimes. The parameters are estimated through an on-line algorithm, which utilizes incoming information from the market and leads to adaptive optimal estimates. We robustify this algorithm step by step against additive outliers appearing in the observed asset prices with the rationale to better handle possible peaks or missings in asset returns.
Keywords: Robustness, HMM, Additive outlier, Asset pricing

## 1 Introduction

Realistic modelling of financial time series from various markets (stocks, commodities, interest rates etc.) in recent years often is achieved through hidden Markov or regime-switching models. One major advantage of regime-switching models is their flexibility to capture switching market conditions or switching behavioural aspects of market participants resulting in a switch in the volatility or mean value.

Regime-switching models were first applied to issues in financial markets through Hamilton (1989), where he established a Markov switching AR-model to model the GNP of the U.S. His results show promising effects of including possible regime-switches into the characterisation of a financial time series. A lot of further approaches to use regime-switching models for financial time series followed, e.g. switching ARCH or switching GARCH models (see for example Cai (1994) and Gray (1996)), amongst many other applications.

Various algorithms and methods for statistical inference are applied within these model set-ups, including as famous ones as the Baum-Welch algorithm and Viterbi’s algorithm for an estimation of the optimal state sequence. HMMs in Finance, both in continuous and in discrete time often utilise a filtering technique which was developed by Elliott (1994). Adaptive filters are derived for processes of the Markov chain (jump process, occupation time process and auxiliary processes) which are in turn used for recursive optimal parameter estimates of the model parameters. This filter-based Expectation-Maximization (EM) algorithm leads to an on-line estimation of model parameters. Our model set-up is based on Elliott’s filtering framework.

This HMM can be applied to questions, which arise in asset allocation problems. An investor typically has to decide, how much of his wealth shall be invested into which asset or asset class and when to optimally restructure a portfolio. Asset allocation problems were examined in a regime-switching setting by Ang and Bekaert (2002), where high volatility and high correlation regimes of asset returns were discovered. Guidolin and Timmermann (2007) presented an asset allocation problem within a regime-switching model and found four different possible underlying market regimes. A paper by Sass and Haussmann (2004) derives optimal trading strategies and filtering techniques in a continuous-time regime-switching model set up. Optimal portfolio choices were also discussed in Elliott and van der Hoek (1997) and Elliott and Hinz (2003) amongst others. Here, Markowitz’s famous mean-variance approach (see Markowitz (1952)) is transferred into an HMM and optimal weights are derived. A similar Markowitz based approach within an HMM was developed in Erlwein et al. (2011), where optimal trading strategies for portfolio decisions with two asset classes are derived. Trading strategies are developed herein to find optimal portfolio decision for an investment in either growth or value stocks. Elliott’s filtering technique is utilised to predict asset returns.

However, most of the optimal parameter estimation techniques for HMMs in the literature only lead to reasonable results, when the market data set does not contain significant outliers. The handling of outliers is an important issue in many financial models, since market data might be unreliable at times or high peaks in asset returns, which might occur in the market from time to time shall be considered separately and shall not negatively influence the parameter estimation method. In general, higher returns in financial time series might belong to a separate regime within an HMM. This flexibility is already included in the model set-up. However, single outliers, which are not typical for any of the regimes considered, shall be handled with care, a separate regime would not reflect the abnormal data point. In this paper, we will develop a robustification of Elliot’s filter-based EM-algorithm. In section 2 we will set the HMM framework, which is applied (either in a one- or multi-dimensional setting) to model asset or index returns. The general filtering technique is described in section 3. The asset allocation problem which clarifies the effect outliers can have on the stability of the filters is developed in section 4. Section 5 then states the derivation of a robustification for various steps in the filter equations. The robustification of a reference probability measure is derived as well as a robust version of the filter-based EM-algorithm. An application of the robust filters is shown in section 6 and section 7 finishes our work with some conclusions and possible future applications.

## 2 Hidden Markov model framework for asset returns

For our problem setting we first review a filtering approach for a hidden Markov model in discrete time which was developed by Elliott (1994). The logarithmic returns of a stock or an index follow the dynamics of the observation process which can be interpreted as a discretized version of the Geometric Brownian motion, which is a standard process to model stock returns. The underlying hidden Markov chain cannot be directly observed. The parameters of the observation process are governed by the Markov chain and are therefore able to switch between regimes over time.

We work under a probability space under which is a homogeneous Markov chain with finite state space in discrete time . Let the state space of be associated with the canonical basis with The initial distribution of is known and is the transition probability matrix with Let be the -field generated by and let be the complete filtration generated by . Under the real world probability measure the Markov chain has the dynamics

 xk+1=Πxk+vk+1 (2.1)

where is a martingale increment (see Theorem in Elliott (1994)).

The Markov chain is “hidden” in the log returns of the stock price Our observation process is given by

 yk+1=lnSk+1Sk=f(xk)+σ(xk)wk+1 (2.2)

where has finite state space and ’s constitute a sequence of i.i.d. random variables independent of The real-valued process can be re-written as

 yk+1=⟨f,xk⟩+⟨σ,xk⟩wk+1. (2.3)

Note that and are vectors, furthermore and , where denotes the Euclidean scalar product in of the vectors and . We assume Let be the filtration generated by the and is the global filtration.

The following theorem (Elliott (1994)) states that the dynamics of the underlying Markov chain can be described by martingale differences.

## 3 Essential Steps in Elliott’s Algorithm

### 3.1 Change of Measure

A widely used concept in filtering applications, going back to Zakai (1969) for stochastic filtering, is a change of probability measure technique. A measure change to a reference measure is applied here, under which filters for the Markov chain and related processes are derived. Under the underlying Markov chain still has the dynamics but is independent of the observation process and the observations are i.i.d. random variables.
Following the change of measure technique which was outlined in Elliott et al. (1995) the adaptive filters for the Markov chain and related processes are derived under this “idealised” measure Changing back to the real world is done by constructing from through the Radon-Nikodŷm derivative To construct we define the process

 λl:=ϕ[σ(xl−1)−1(yl−f(xl−1))]σ(xl−1)ϕ(yl) (3.1)

where is the probability density function of a standard normal random variable and set Under the sequence of variables is a sequence of i.i.d. standard normals, where we have

### 3.2 Filtering for general adapted processes

The general filtering techniques and the filter equations which were established by Elliott (1994) for Markov chains observed in Gaussian noise are stated in this subsection. This filter-based EM-algorithm is adaptive, which enables fast calculations and filter updates. Our robustification partly keeps this adaptive structure of the algorithm, although the recursivity cannot be kept completely.
In general, filters for four types of processes related to the Markov chain, namely the state space process, the jump process, the occupation time process and auxiliary processes including terms of the observation process are derived. Information on these processes can be filtered out from our observation process and can in turn be used to find optimal parameter estimates.

To determine the expectation of any adapted stochastic process given the filtration consider the reference probability measure defined as From Bayes’ theorem a filter for any adapted process is given by We define so that A recursive relationship between and has to be found, where However, a recursive formula for the term is found. To relate and we note that with

 ⟨1,ηk(Hkxk)⟩=ηk(Hk⟨1,xk⟩)=ηk(Hk). (3.2)

Therefore

 E[Hk∣Fyk] = ⟨1,ηk(Hkxk)⟩⟨1,ηk(xk)⟩. (3.3)

A general recursive filter for adapted processes was derived by Elliott (1994). Suppose is a scalar adapted process, is measurable and where and are -predictable, is a scalar-valued function and A recursive relation for is given by

 ηk(Hkxk) = N∑i=1Γi(yk)[⟨ei,ηk−1(Hk−1xk−1)⟩Πei (3.4) \leavevmode\nobreak \leavevmode\nobreak +⟨ei,ηk−1(akxk−1)⟩Πei \leavevmode\nobreak \leavevmode\nobreak +({diag}(Πei)−(Πei)(Πei)′)ηk−1(bk⟨ei,xk−1⟩) \leavevmode\nobreak \leavevmode\nobreak +ηk−1(gk⟨ei,xk−1⟩)f(yk)Πei]

Here, for any column vectors and denotes the rank-one (if and ) matrix The term denotes the component-wise Radon-Nikodŷm derivative :

 Γi(yk)=ϕ(yk−fiσi)/σiϕ(yk)

Now, filters for the state of the Markov chain as well as for three related processes: the jump process, the occupation time process and auxiliary processes of the Markov chain are derived. These processes can be characterised as special cases of the general process

The estimator for the state is derived from by setting and This implies that

 ηk(xk)=N∑i=1Γi(yk)⟨ei,ηk−1(xk−1)⟩Πei\leavevmode\nobreak \leavevmode\nobreak . (3.5)

The first related process is the number of jumps of the Markov chain from state to state in time . Setting , , and in equation (3.4) we get

 ηk(Jsrkxk) = N∑i=1Γi(yk)⟨ηk−1(Jsrk−1xk−1),ei⟩Πei (3.6) +Γr(yk)ηk−1(⟨xk−1,er⟩)πsres\leavevmode\nobreak \leavevmode\nobreak .

The second process denotes the occupation time of the Markov process , which is the length of time spent in state up to time Here, We set , , and in equation (3.4) to obtain

 ηk(Orkxk) = N∑i=1Γi(yk)⟨ηk−1(Ork−1xk−1),ei⟩Πei (3.7) +Γr(yk)⟨ηk−1(xk−1),er⟩Πer\leavevmode\nobreak \leavevmode\nobreak .

Finally, consider the auxiliary process , which occur in the maximum likelihood estimation of model parameters. Specifically, where is a function of the form or We apply formula (3.4) and get

 ηk(Trk(g)xk) = N∑i=1Γi(yk)⟨ηk−1(Trk−1(g(yk−1))xk−1),ei⟩Πei (3.8) +Γr(yk)⟨ηk−1(xk−1),er⟩g(yk)Πer.

The recursive optimal estimates of and can be calculated using equation (3.2).

### 3.3 Filter-based EM-algorithm

The derived adapted filters for processes of the Markov chain can now be utilised to derive optimal parameter estimates through a filter-based EM-algorithm. The set of parameters , which determines the regime-switching model is

 ρ={πji,1≤i,j≤N,fi,σi,1≤i≤N}. (3.9)

Initial values for the EM algorithm are assumed to be given. Starting from these values updated parameter estimates are derived which maximise the conditional expectation of the log-likelihoods. The M-step of the algorithm deals with maximizing the following likelihoods:
M-Step

• The likelihood in the global -model is given by

 logΛt(σ,f;(xs,ys)s≤t)=−12t∑s=1(log⟨σ,xs−1⟩+(ys−⟨f,xs−1⟩)2⟨σ,xs⟩)
• In the -model, where the Markov chain is not observed, we obtain

 Lt(σ,f;(ys)s≤t=E[logΛt(σ,f;(xs,ys)s≤t)∣Ft]= (3.10) = −12N∑k=1(logσk^Okt+(^Tkt(y2)−2^Tkt(y)fk+^Oktf2)/σ2k)

The maximum likelihood estimates of the model parameters can be expressed through the adapted filters. Whenever new information is available on the market, the filters are updated and, respectively, updated parameter estimates can be obtained.

###### Theorem 3.1 (Optimal parameter estimates)

Write for any adapted process With , and denoting the best estimates for the processes , and , respectively, the optimal parameter estimates and are given by

 ^πji = ^Jjik^Oik=ηk(Jjik)ηk(Oik) (3.11) ˆfi = ˆT(i)kˆO(i)k=η(T(i)(y))kη(O(i))k (3.12) ˆσi =   ⎷ˆT(i)k(y2)−2ˆfiˆT(i)k(y)+ˆf2iˆO(i)kˆO(i)k. (3.13)
###### Proof.

The derivation of the optimal parameter estimates can be found in Elliott et al. (1995). ∎

Summary The filter-based EM-algorithm runs in batches of data points ( typically equals to a minimum of ten up to a maximum of fifty) over the given time series. The parameters are updated at the end of each batch. Elliott’s Algorithm comprises the following steps

1. Find suitable starting values for and ,

2. Determine the RN-derivative for the measure change to

3. Recursively, compute filters , , and

4. Obtain ML-estimators and

5. Obtain ML-estimators

6. Go to (RN) to compute the next batch.

## 4 Outliers in Asset Allocation Problem

### 4.1 Outliers in General

In the following sections we derive a robustification of the Algorithm 3.3 to stabilize it in the presence of outliers in the observation process. To this end let us discuss what makes an observation an outlier. First of all, outliers are exceptional events, occurring rarely, say with probability . Rather than captured by usual randomness, i.e., by some distributional model, they belong to what Knight (1921) refers to uncertainty: They are uncontrollable, of unknown distribution, unpredictable, their distribution may change from observation to observation, so they are non-recurrent and do not form an additional state, so cannot be used to enhance predictive power, and, what makes their treatment difficult, they often cannot be told with certainty from ideal observations.

Still, the majority of the observations in a realistic sample should resemble an ideal (distributional) setting closely, otherwise the modeling would be questionable. Here we understand closeness as in a distributional sense, as captured, e.g., by goodness-of-fit distances like Kolmogorov, total variation or Hellinger distance. More precisely, ideally, this closeness should be compatible to the usual convergence mode of the Central Limit Theorem, i.e., with weak topology. In particular, closeness in moments is incompatible with this idea.

Topologically speaking, one would most naturally use balls around a certain element, i.e., the set of all distributions with a suitable distance no larger than some given radius to the distribution assumed in the ideal model.

Conceptually, the most tractable neighborhoods are given by the so-called Gross Error Model, defining a neighborhood about a distribution as the set of all distributions given by

 Uc(F,ε)={G|∃H:G=(1−ε)F+εH} (4.1)

They can also be thought of as the set of all distributions of (realstic) random variables constructed as

 Xre=(1−U)Xid+UXdi (4.2)

where is a random variable distributed according to the ideal distribution and is an independent switching variable, which in most cases lets you see but in some cases replaces it by some contaminating or distorting variable which has nothing to do with the original situation.

### 4.2 Time-dependent Context: Exogenous and Endogenous Outliers

In our time dependent setup in addition to the i.i.d. situation, we have to distinguish whether the impact of an outlier is propagated to subsequent observations or not. Historically there is a common terminology due to Fox (1972), who distinguishes innovation outliers (or IO’s) and additive outliers (or AO’s). Non-propagating AO’s are added at random to single observations, while IO’s denote gross errors affecting the innovations. For consistency with literature, we use the same terms, but use them in a wider sense: IO’s stand for general endogenous outliers entering the state layer (or the Markov chain in the present context), hence with propagated distortion. As in our Markov chain, the state space is finite, IO’s are much less threatening as they are in general.

Correspondingly, wide-sense AO’s denote general exogenous outliers which do not propagate, hence also comprise substitutive outliers or SO’s as defined in a simple generalization of (4.2) to the state space context in equations (4.3)–(4.6).

 Yre=(1−U)Yid+UYdi,U∼Bin(1,r) (4.3)

for independent of and some arbitrary distorting random variable for which we assume

 Ydi,Xindependent (4.4)

and the law of which is arbitrary, unknown and uncontrollable. As a first step consider the set defined as

 ∂USO(r)={L(X,Yre)|Yreacc. to \eqref% {YSO} and (???)} (4.5)

Because of condition (4.4), in the sequel we refer to the random variables and instead of their respective (marginal) distributions only, while in the common gross error model, reference to the respective distributions would suffice. Condition (4.4) also entails that in general, contrary to the gross error model, is not element of , i.e., not representable itself as some in this neighborhood.

As corresponding (convex) neighborhood we define

 USO(r)=⋃0≤s≤r∂USO(s) (4.6)

hence the symbol “” in , as the latter can be interpreted as the corresponding surface of this ball. Of course, contains . In the sequel where clear from the context we drop the superscript and the argument .

Due to their different nature, as a rule, IO’s and AO’s require different policies: As AO’s are exogenous, we would like to damp their effect, while when there are IO’s, something has happened in the system, so the usual goal will be to track these changes as fast as possible.

### 4.3 Evidence for Robustness Issue in Asset Allocation

In this section we examine the robustness of the filter and parameter estimation technique. The filter technique is implemented and applied to monthly returns of MSCI index between 1994 and 2009. The MSCI World Index is one of the leading indices on the stock markets and a common benchmark for global stocks. The algorithm is implemented with batches of ten data points, therefore the adaptive filters are updated whenever ten new data points are available on the market. The recursive parameter estimates, which utilise this new information, are updated as well, the algorithm is self-tuning. Care has to be taken when choosing the initial values for the algorithm, since the EM-algorithm in its general form converges to a local maximum. In this implementation we choose the initial values with regard to mean and variance of the first ten data points. Figure 1 shows the original return series, optimal parameter estimates for the index returns as well as the one-step ahead forecast.

To highlight the sensitivity of the filter technique towards exogenous outliers we plant unusual high returns within the time series. Considerable SO outliers are included at time steps The optimal parameter estimation through the filter-based EM-algorithm of this data set with outliers can be seen in Figure 2. The filter still finds optimal parameter estimates, although the estimates are visibly affected by the outliers.

In a third step, severe outliers are planted into the observation sequence. Data points now show severe SO outliers as can be seen from the first panel in Figure 3. The filters cannot run through any longer, optimal parameter estimates cannot be established in a setting with severe outliers.

In practice, asset or index return time series can certainly include outliers from time to time. This might be due to wrong prices in the system, but also due to very unlikely market turbulence for a short period of time. It has to be noted, that the type of outliers which we consider in this study does not characterise an additional state of the Markov chain. In the following, we develop robust filter equations, which can handle exogenous outliers.

## 5 Robust Statistics

To overcome effects like in Figure 3 we need more stable variants of the Elliott type filters discussed so far. This is what robust statistics is concerned with. Excellent monographs on this topic are e.g., Huber (1981), Hampel et al. (1986), Rieder (1994), Maronna et al. (2006). This section provides necessary concepts and results from robust statistics needed to obtain the optimally-robust estimators used in this article.

### 5.1 Concepts of Robust Statistics

The central mathematical concepts of continuity, differentiability, or closeness to singularities may in fact serve to operationalize stability quite well already. To make these available in our context, it helps to consider a statistical procedure, i.e.; an estimator, a predictor, a filter, or a test as a function of the underlying distribution. In a parametric context, this amounts to considering functionals mapping distributions to the parameter set . An estimator will then simply be applied to the empirical distribution . For filtering or prediction, the range of such a functional will rather be the state space but otherwise the arguments run in parallel.

For a notion of continuity, we have to specify a topology, and as in case of outliers, we use topologies essentially compatible with the weak topology. With these neighborhoods, we now may easily translate the notions of continuity, differentiability and closest singularity to this context: (Equi-)continuity is then called qualitative robustness (Hampel et al., 1986, Sec. 2.2 Def. 3), a differentiable functional with a bounded derivative is called local robust, and its derivative is called influence function (IF)111In mathematical rigor, the IF, when it exists, is the Gâteaux derivative of functional into the direction of the tangent . For certain properties, this notion is in fact too weak, and one has to require stronger notions like Hadamard or Fréchet differentiability; for details, see Fernholz (1983) or (Rieder, 1994, Ch. 1)., compare (Hampel et al., 1986, Sec. 2.1 Def. 1). The IF reflects the infinitesimal influence of a single observation on the estimator. Under additional assumptions, many of the asymptotic properties of an estimator are expressions in the IF . E.g., the asymptotic variance of the estimator in the ideal model is the second moment of . Infinitesimally, i.e., for , the maximal bias on is just , where denotes Euclidean norm. is then also called gross error sensitivity (GES), (Hampel et al., 1986, (2.1.13)). Seeking robust optimality hence amounts to finding optimal IFs.

To grasp the maximal bias of a functional on a neighborhood of radius , one considers the max-bias curve . The singularity of this curve closest to 0 (i.e., the ideal situation of no outliers) captures its behavior under massive deviations, or its global robustness. In robust statistics, this is called breakdown point—the maximal radius the estimator can cope with without producing an arbitrary large bias, see (Hampel et al., 1986, Sec. 2.2 Def.’s 1,2) for formal definitions.

Usually, the classically optimal estimators (MLE in many circumstances) are non-robust, both locally and globally. Robust estimators on the other hand pay a certain price for this stability as expressed by an asymptotic relative efficiency (ARE) strictly lower than in the ideal model, where ARE is the ratio of the two asymptotic (co)variances of the classically optimal estimator and its robust alternative.

To rank various robust procedures among themselves, other quality criteria are needed, though, summarizing the behavior of the procedure on a whole neighborhood, as in (4.1). A natural candidate for such a criterion is maximal MSE (maxMSE) on some neighborhood around the ideal model and, for estimation context, maximal bias (maxBias) on the respective neighborhood, or, referring to famous (Hampel, 1968, Lemma 5), trace of the ideal variance subject to a bias bound on this neighborhood. In estimation context, the respective solution are usually called OMSE (Optimal MSE estimator), MBRE (Most Bias Robust Estimator), and OBRE (Optimally Bias Robust Estimator)222For the terms OBRE and MBRE, see Hampel et al. (1986), while for OMSE see Ruckdeschel and Horbenko (2010)..

In our context we encounter two different situations where we want to apply robust ideas: (recursive) filtering in the (E)-step and estimation in the (M)-step. While in the former situation we only add a single new observation, which precludes asymptotic arguments, in the (M)-step, the preceding application of Girsanov’s theorem turns our situation into an i.i.d. setup, where each observation becomes (uniformly) asymptotically negligible and asymptotics apply in a standard form.

### 5.2 Our Robustification of the HMM: General Strategy

As a robustification of the whole estimation process in this only partially observed model would (a) lead to computationally intractable terms and (b) would drop the key feature of recursivity, we instead propose to robustify each of the steps in Elliott’s algorithm separately. Doing so, the whole procedure will be robust, but in general will loose robust optimality, i.e.; contrary to multiparameter maximum likelihood, the Bellmann principle does not hold for optimal robust multi-step procedures simply because optimal clipping in several steps is not the same as joint optimal clipping of all steps. Table 1 lists all essential steps in Elliot’s algorithm related to our proposed robustification approach.

### 5.3 Robustification of Step (0)

So far, little has been said as to the initialization even in the non-robustified setting. Basically, all we have to do is to make sure that the EM algorithm converges. In prior applications of this algorithms (Mamon et al. (2008) and Erlwein et al. (2011) amongst others), one approach was to fill with entries , i.e., with uniform (and hence non-informative) distribution over all states, independent from state . As to and , an ad hoc approach would estimate the global mean and variance over all states and then, again in a non-informative way jitter the state-individual moments, adding independent noise to it. In our preliminary experiments, it turned out that this naïve approach could drastically fail in the presence of outliers, so we instead propose a more sophisticated approach which can also be applied in a classical (i.e., non-robust) setting: In a first step we ignore the time dynamics and interpret our observations as realizations of a Gaussian Mixture Model, for which we use R package mclust (Fraley and Raftery (2002); Fraley et al. (2012)) to identify the mixture components, and for each of these, we individually determine the moments and . As to , we again assume independence of , but fill the columns according to the estimated frequencies of the mixture components. In case of the non-robust setting we would use mixture components and for each of them determine and by their ML estimators (assuming independent observations). For a robust approach, we use mixture components, one of them—the one with the lowest frequency—being a pure noise component capturing outliers. For each non-noise component we retain the ML estimates for and . The noise component is then randomly distributed amongst the remaining components, respecting their relative frequencies prior to this redistribution.

We are aware of the fact that reassigning the putative outliers at random could be misleading in ideal situations (with no outliers) where one cluster could be split off into two but not necessarily so. Then in our strategy, the smaller offspring of this cluster would in part be reassigned to wrong other clusters, so this could still be worked on. On the other hand, this choice often works reasonably well, and as more sophisticated strategies are questions for model selection, we defer them to further work.

Based on the and , for each observation and each state , we get weights , for each , representing the likelihood that observation is in state . For each , again we determine robustified moment estimators , as weighted medians and scaled weighted MADs (medians of absolute deviations).

Weighted Medians And MADs: For weights and observations , the weighted median is defined as , and with , the scaled weighted MAD is defined as , where is a consistency factor to warrant consistent estimation of in case of Gaussian observations, i.e., for . can be obtained empirically for a sufficiently large sample size , e.g., , setting , , .

As to the (finite sample) breakdown point of the weighted median (and at the same time for the scaled weighted MAD), we define , and for each define the ordered weights such that . Then the FSBP in both cases is which (for equivariant estimators) can be shown to be the largest possible value. So using weighted medians and MADs, we achieve a decent degree of robustness against outliers. E.g., assume we have observations with weights . Then we need at least three outliers (placed at weights , respectively) to produce a breakdown.

### 5.4 Robustification of the E-step

As indicated, in this step we cannot recur to asymptotics, but rather have to appeal to a theory particularly suited for this recursive setting. In particular, the SO-neighborhoods introduced in (4.3) turn out to be helpful here.

#### 5.4.1 Crucial Optimality Thm

Consider the following optimization problem of reconstructing the ideal observation by means of the realistic/possibly contaminated on an SO-neighborhood.

##### Minimax-SO problem

Minimize the maximal MSE on an SO-neighborhood, i.e., find a -measurable reconstruction for s.t.

 maxUEre|Yid−f(Yre)|2=minf! (5.1)

The solution is given by

###### Theorem 5.1 (Minimax-SO)

In this situation, there is a saddle-point for Problem (5.1)

 f0(y) := EYid+Hρ(D(y)),Hb(z)=zmin{1,b/|z|} (5.2) PYdi0(dy) := 1−rr(∣∣D(y)∣∣/ρ−1)+PYid(dy) (5.3)

where ensures that and

 D(y)=y−EYid (5.4)

The value of the minimax risk of Problem (5.1) is

 trCov(Yid)−(1−r)Eid[min{|D(Yid)|,ρ}2] (5.5)
###### Proof.

See Appendix Acknowledgement. ∎

The optimal procedure in equation (5.2) has an appealing interpretation: It is a compromise between the (unobservable) situations that (a) one observes the ideal , in which case one would use it unchanged and (b) one observes , i.e.; something completely unrelated to , hence one would use the best prediction for (in MSE-sense) without any information, hence the unconditional expectation . The decision on how much to tend to case (a) and how much to case (b) is taken according to the (length of the) discrepancy between observed signal and . If this length is smaller than , we keep unchanged, otherwise we modify by adding a clipped version of .

#### 5.4.2 Robustification of Steps (RN), (E)

In the Girsanov-/change of measure step we recall that the corresponding likelihood ratio here is just

 λs:=σ−1(xs−1)φ(σ−1(xs−1)(ys−f(xs−1)))φ(ys) (5.6)

Apparently can both take values close to , and, more dangerously, in particular for small values of , very large values. So bounding the is crucial to avoid effects like in Figure 3.

A first (non-robust) remedy uses a data driven reference measure, i.e., instead of , we use where is a global scale measure taken over all observations, ignoring time dependence and state-varying ’s. A robust proposal would take to be the MAD of all observations (tuned for consistency at the normal distribution). This leads to

 ~λs:=σ−1(xs−1)φ(σ−1(xs−1)(ys−f(xs−1)))¯σ−1φ(¯σ−1ys) (5.7)

Eventually, in both estimation and filtering/prediction, cancels out as a common factor in nominator and denominator, so is irrelevant in the subsequent steps; its mere purpose is to stabilize the terms in numeric aspects.

To take into account time dynamics in our robustification, we want to use Theorem 5.1, but to this end, we need second moments, which for need not exist. So instead, we apply the theorem to , which means that is robustified by

 ¯λs=(Eid√~λs+Hb(√~λs−Eid√~λs))2 (5.8)

for . Clipping height in turn is chosen such that , for instance. As in the ideal situation , in a last step with a consistency factor determined similarly to in the initialization step for the weighted MADs, we pass over to such that .

Similarly, in the remaining parts of the E-step, for each of the filtered processes generically denoted by and the filtered one by , we could replace by

 ¯G=Eid^G+Hb(^G−Eid^G) (5.9)

for any of , , and and again suitably chosen .

It turns out though, that it is preferable to pursue another route. The aggregates are used in the M-step in (3.10), but for a robustification of this step, it is crucial to be able to attribute individual influence to each of the observations, so instead we split up the terms of the filtered neg-loglikelihood into summands for , and hence we may skip a robustification of . Similarly, as , are filtered observations of multinomial-like variables, a robustification is of limited use, as any contribution of a single observation to these variables can at most be of absolute value , so is bounded anyway. Hence in the E-step, we only robustify . The splitting up the aggregates into summands amounts to giving up strict recursivity, as for observations in one batch, one now has to store the values for , and building up from , at observation time within the batch, we construct , from the values , , so we have a growing triangle of weight values. This would lead to increasing memory requirements, if we had not chosen to work in batches of fixed length , which puts an upper bound onto memory needs.

### 5.5 Robustification of the (M)-Step

As mentioned before, contrast to the (E)-step, in this estimation step, we may work with classical gross error neighborhoods (4.1) and with the standard i.i.d. setting.

#### 5.5.1 Shrinking Neighborhood Approach

By Bienaymé, variance then usually is for sample size , while for robust estimators, the maximal bias is proportional to the neighborhood radius . Hence unless is appropriately scaled in , for growing , bias will dominate eventually for growing . This is avoided in the shrinking neighborhood approach by setting for some , compare Rieder (1994). Kohl et al. (2010) sets for some initial radius . One could see this shrinking as indicating that with growing , diligence is increasing so the rate of outliers is decreasing. This is perhaps overly optimistic. Another interpretation is that the severeness of the robustness problem with outliers at sample size should not be compared with the one with outliers at sample size but rather with the one with outliers at this sample size.

In this shrinking neighborhood setting, with mathematical rigor, optimization of the robust criteria can be deferred to the respective IFs, i.e. instead of determining the IF of a given procedure, we construct a procedure to a given (optimally-robust) IF.

This is achieved by the concept of asymptotically linear estimators (ALEs), as it arises canonically in most proofs of asymptotic normality: In a smooth (-differentiable) parametric model for i.i.d observations with open parameter domain based on the scores333Usually is the logarithmic derivative of the density w.r.t. the parameter, i.e., . and its finite Fisher information , we define the set of influence functions as the subset of consisting of square integrable functions with coordinates with and where is the -dimensional unit matrix. Then a sequence of estimators is called an ALE if

 Sn=θ+1nn∑i=1ψθ(Xi)+oPnθ(n−1/2) (5.10)

for some influence function . In the sequel we fix the true and suppress it from notation where clear from context.

In particular, the MLE usually has influence function , while most other common estimators also have a representation (5.10) with a different .

For given IF we may construct an ALE with as IF by a one-step construction, often called one-step-reweighting: To given starting estimator such that we define

 ^θn=θ0n+1nn∑j=1ψθ0n(Xj) (5.11)

Then indeed and , i.e., forgets about as to its asymptotic variance and GES; however its breakdown point is inherited from once is unbounded and is bounded. Hence for the starting estimator, we seek for with high breakdown point.

For a more detailed account on this approach, see Rieder (1994).

#### 5.5.2 Shrinking Neighborhood Approach With Weighted Observations

As mentioned, coming from the E-step, not all observations are equally likely to contribute to state , hence we are in a situation with weighted observations, where we may pass over to normed weights summing up to .

Suppressing state index from notation here, with these weights and the vector of weighted median and scaled weighted MAD for state , (5.11) becomes

 ^θn=θ0n+n∑j=1w0jψθ0n(yj) (5.12)

for