The Viterbi process, decay-convexity and parallelized maximum a-posteriori estimation

The Viterbi process, decay-convexity and parallelized maximum a-posteriori estimation

Abstract

The Viterbi process is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. The existence of such a process suggests that approximate estimation using optimization algorithms which process data segments in parallel may be accurate. For models on state-space satisfying a new “decay-convexity” condition, we develop an approach to existence of the Viterbi process via fixed points of ordinary differential equations in a certain infinite dimensional Hilbert space. Bounds on the distance to the Viterbi process show that approximate estimation via parallelization can indeed be accurate and scaleable to high-dimensional problems because the rate of convergence to the Viterbi process does not necessarily depend on . The results are applied to a factor model with stochastic volatility and a model of neural population activity.

C
\jname

Biometrika

onvex Optimization, Gradient Flow, State-space Models.

1 Introduction

1.1 Background and motivation

Consider a process where is a Markov chain with state space whose initial distribution and transition kernel admit densities and with respect to Lebesgue measure, and , each valued in a measurable space , are conditionally independent given and such that for any , the conditional probability of given can be written in the form , where and is a measure on . Models of this form, under names of hidden Markov or state-space models, are applied in a wide variety of fields including econometrics, engineering, ecology, machine learning and neuroscience (Durbin & Koopman, 2012; Douc et al., 2009; West & Harrison, 2006).

Throughout this paper and unless specified otherwise, a distinguished data sequence is considered fixed and suppressed from much of the notation. Define:

 Un(x0,…,xn) =−logμ(x0)−logg(x0,y⋆0) −n∑m=1logf(xm−1,xm)−n∑m=1logg(xm,y⋆m). (1)

The posterior density at the path given , say , is then proportional to . The maximum a-posteriori path estimation problem given is to find:

 (ξn0,…,ξnn)=argminx0,…,xnUn(x0,…,xn). (2)

In addition to serving as a point estimate of the hidden trajectory, the solution of (2) is of interest when calculating the Bayesian information criterion (Schwarz, 1978) with non-uniform priors over the hidden trajectory, can be used in initialization of Markov chain Monte Carlo algorithms to sample from , and for log-concave posterior densities is automatically accompanied by universal bounds on highest posterior density credible regions thanks to concentration of measure inequalities (Pereyra, 2017; Bobkov & Madiman, 2011).

The Viterbi process is a sequence such that for any ,

 (ξ∞0,…,ξ∞n)=limm→∞(ξm0,…,ξmn). (3)

Its existence was first studied in the information theory literature, (Caliebe & Rosler, 2002; Caliebe, 2006), for models in which the state-space of is a set of a finite number of states and the convergence in (3) is with respect to the discrete metric. The “Viterbi process” name appeared later, in (Lember & Koloydenko, 2010), inspired by the famous Viterbi decoding algorithm (Viterbi, 1967). We focus on the case of state-space . The only other work known to the authors which considers the Viterbi process with state-space is (Chigansky & Ritov, 2011), where convergence in (3) is with respect to Euclidean distance.

In these studies the Viterbi process appears to be primarily of theoretical interest. Here we consider practical motivation in a similar spirit to distributed optimization methods, e.g., (Moallemi & Van Roy, 2010; Rebeschini & Tatikonda, 2016a, b): the existence of the limit in (3) suggests that (2) can be solved approximately using a collection of optimization algorithms which process data segments in parallel. To sketch the parallelization idea, with and assumed to be integers, consider the partition:

 (0,…,n)=ℓ⋃k=1Ak,Ak={(k−1)Δ,…,kΔ−1},

and for an integer consider the -enlargement of each ,

 Ak(δ)={m∈(0,…,n):∃a∈Ak:|m−a|≤δ}. (4)

Suppose the optimization problems:

 argmaxxAk(δ)pr(xAk(δ)|y⋆Ak(δ)),k=1,…,ℓ, (5)

where , , are solved in parallel. Then in a post-processing step, for each , the components indexed by of the solution to are discarded, and what remains concatenated across to give an approximation to the solution of (2).

If it takes time to solve (2) the speed-up from parallelization could be as much as a factor of . The main problem addressed in this paper is to study the rate of convergence to the Viterbi process in (3), and as a corollary we shall quantify the approximation error which trades off against the speed-up from parallelization as a function of , , the ingredients of the statistical model and properties of the observation sequence.

1.2 Relation to existing works

We approach the solutions of (2) indexed by and there tendency to the Viterbi process in an infinite dimensional Hilbert space, , where is a parameter related to the rate of convergence to the Viterbi process. This approach is new and has two benefits. Firstly, it allows interpretable quantitative bounds to be obtained which measure the distance to the Viterbi process in a norm on which gives a stronger notion of convergence than the pointwise convergence in (3). Secondly, via a new “decay-convexity” property of which may be of independent interest, our approach provides a characterization of the Viterbi process as the fixed point of an infinite dimensional ordinary differential equation which arises in the limit .

In totality, the collection of assumptions we make is neither stronger nor weaker than the collection of assumptions of (Chigansky & Ritov, 2011, Thm 3.1). Comparisons between some individual assumptions are discussed in section A.5. One commonality is that both our assumptions (see the decay-convexity condition in Theorem 1 combined with Lemma 1) and the assumptions of (Chigansky & Ritov, 2011, Thm 3.1) imply that is strongly log-concave, in the sense of (Saumard & Wellner, 2014).

From a statistical modelling perspective, this strong log-concavity might seem restrictive. However, the merit of assuming strong log-concavity must also take into account its attractive mathematical and computational consequences: strong-convexity of objective functions and strong-log concavity of target probability densities endows gradient-descent algorithms and certain families of diffusion Markov chain Monte Carlo algorithms with dimension-free convergence rates (Bubeck, 2015; Dalalyan, 2017) and plays a role in dimension-free contraction rates for the filtering equations of hidden Markov models (Whiteley, 2017). The notion of decay-convexity introduced here extends these dimension-free phenomena, in particular under our assumptions we shall illustrate that the parameter controlling the rate of convergence to the Viterbi process does not necessarily depend on .

The proof techniques of (Chigansky & Ritov, 2011, Thm 3.1) are quite different to ours. There the existence of the limit (3) is established using a converging series argument to bound terms in a dynamic programming recursion. A quantitative bound on the Euclidean distance between and is given in (Chigansky & Ritov, 2011, eqs. (3.13) and (3.15)); we address a stronger notion of convergence on the Hilbert space . The proof of (Chigansky & Ritov, 2011, Thm 3.1) is given only in the case , but the same approach may be applicable more generally.

Earlier works concerning discrete-state hidden Markov models (Caliebe & Rosler, 2002; Lember & Koloydenko, 2008) establish the existence of the limit in (3) by identifying stopping times which divide the optimal path into unrelated segments. (Chigansky & Ritov, 2011, Example 2.1) illustrates that this approach to existence of the limit can also be made to work when the state-space is , but it seems not to yield quantitative bounds easily.

In the broader literature on convex optimization, a theory of sensitivity of optimal points with respect to constraints in convex network optimization problems has been introduced by (Rebeschini & Tatikonda, 2016a, b). The notions of scale-free optimization developed there are similar in spirit to the objectives of the present paper, but the results are not directly comparable to ours since they concern a constrained optimization problem. In the context of unconstrained convex optimization problems with separable objective functions which allow for the structure of (2), (Moallemi & Van Roy, 2010) addressed the convergence of a min-sum message passing algorithm. Again some aspects of their analysis are similar in spirit to ours, but their aims and results are quite different.

Amongst our main assumptions will be continuous differentiability of the terms on the right of (1). Considering (3) as a regularized maximum likelihood problem, where the regularization comes from and , it would be particularly interesting to relax the differentiability assumption in order to accommodate sparsity inducing Lasso-type regularizers (Tibshirani, 1996), but this is beyond the scope of the present paper.

Whilst we restrict our attention to the objective functions in (1) associated with hidden Markov models where represents time, the techniques we develop could easily be generalized to objective functions which are additive functionals across tuples (rather than just pairs) of variables, and to situations where the arguments of the objective function are indexed over some set with a spatio-temporal (rather than just temporal) interpretation. Indeed many of the techniques presented here are not specific to hidden Markov models at all and may be of wider interest.

2 Main results

2.1 Definitions and assumptions

With considered fixed, we shall associate with a generic vector the vectors , each in , such that . With and the usual Euclidean inner product and norm on , define the inner product and norm on associated with a given ,

Let be the Hilbert space consisting of the set equipped with the inner-product and the usual element-wise addition and scalar multiplication of vectors over field . For each , denotes the subspace consisting of those such that for , with the convention that . Note that for the set of vectors does not actually depend on ; the notation is used to emphasize that is a subspace of .

For define

 ϕn(x) =logf(xn−1,xn)+logf(xn,xn+1)+logg(xn,y⋆n),n≥1, (6) ~ϕn(x) ={logμ(x0)+logf(x0,x1)+logg(x0,y⋆0),n=0,logf(xn−1,xn)+logg(xn,y⋆n),n≥1. (7)

and let and be the vectors in whose th entries are the partial derivatives of and with respect to the th entry of (the existence of such derivatives is part of Condition 2.1 below).

For each , define the vector field ,

 ∇Un(x)=−{∇0~ϕ0(x)T∇1ϕ1(x)T⋯∇n−1ϕn−1(x)T∇n~ϕn(x)T00⋯}T. (8)

With these definitions, the first elements of the vector are the partial derivatives of with respect to the elements of , whilst the other elements of the vector are zero. This zero-padding of to make an infinitely long vector is a mathematical convenience which will allow us to treat as a sequence of vector fields on .

Define also

 αγ,n=n∑m=0γn−mβm, βm=∥∇mϕm(0)∥2∨∥∇m~ϕm(0)∥2, (9)
 ηn(r)=sup∥x∥2γ,n≤r∥∇nϕn(x)∥2∨∥∇n~ϕn(x)∥2,∥x∥γ,n=(∞∑m=0γ|m−n|∥xm∥2)1/2. (10)
{condition}

a) , , and , , are everywhere strictly positive and continuously differentiable.

b) there exist constants such that , and for all

 ⟨xn−x′n,∇nϕn(x)−∇nϕn(x′)⟩

and

 ⟨xn−x′n,∇n~ϕn(x)−∇n~ϕn(x′)⟩ ≤{−~ζ∥x0−x′0∥2+θ∥x0−x′0∥∥x1−x′1∥,n=0,−~ζ∥xn−x′n∥2+θ∥xn−x′n∥∥x′n−1−x′n−1∥,for all n≥1.

2.2 Viterbi process as the limit of a Cauchy sequence in l2(γ)

Theorem 1

Assume that Condition 2.1 holds, and with as therein, let be any value in such that:

 ζ>θ(1+γ)22γ,~ζ>θ(1+γ)2γ. (11)

Then with any such that:

 0<λ≤{ζ−θ(1+γ)22γ}∧{~ζ−θ(1+γ)2γ}, (12)

and any ,

 (13)

Amongst all the vectors in , there is a unique vector such that , and

 supm∈N0,m≥n∥ξn−ξm∥2γ≤γnλ2ηn(λ−2αγ,n)+γn+1λ2ηn+1(γλ−2αγ,n)+1λ2∞∑k=n+2γkβk. (14)

The proof of Theorem 1 is in section A.3.

Remark 1

Since , the first elements of the vector solve the estimation problem (A.1), and since , the remaining elements of are zero.

Remark 2

When Condition 2.1 holds, there always exists satisfying (11) and satisfying (12) because and Condition 2.1 requires . The case is of interest because if the right hand side of (14) converges to zero as , then is a Cauchy sequence in , yielding the existence of the Viterbi process, as per the following corollary.

Corollary 1

If in addition to the assumptions of Theorem 1, and , then there exists in such that .

The assumptions of Corollary 1 on , and implicitly involve the observation sequence . An explicit discussion of the impact of is given in section 3.

2.3 Interpretation of the decay-convexity condition

From hereon (13) will be referred to as “decay-convexity” of . To explain the “convexity” part of this term, note that when , (13) says exactly that is -strongly log-concave, in the sense of (Saumard & Wellner, 2014).

To explain the “decay” part of decay-convexity, let us now address the case . It is well known that strong convexity of a continuously differentiable function is closely connected to exponential contraction properties of the associated gradient-flow differential equation. This connection underlies convergence analysis of gradient-descent algorithms, see for example (Nesterov, 2013, chapter 2). The inequality (13) can be interpreted similarly for any : using standard arguments for finite-dimensional ODE’s (a more general Hilbert space setting is fully treated in Proposition 1 in section A.2), it can be shown that when Condition 2.1a) and (13) hold, there is a unique, globally-defined flow which solves:

 ddtΦnt(x)=−∇Un{Φnt(x)},Φn0=Id. (15)

Here is a vector in , and the derivative with respect to time is element-wise. Noting the zero-padding of in (8), the first elements of together constitute the gradient flow associated with , whilst each of the remaining elements is the identity mapping on . Thus for , can be written as a sum of finitely many terms and by simple differentiation,

 ddt∥Φnt(x)−Φnt(x′)∥2γ=−2⟨Φnt(x)−Φnt(x′),∇Un{Φnt(x)}−∇Un{Φnt(x′)}⟩γ,

for all . Since implies , it follows from (13) that

 ∥Φnt(x)−Φnt(x′)∥γ≤e−λt∥x−x′∥γ,for all x,x′∈ln2(γ). (16)

To see the significance of the case , suppose that the initial conditions are such that for all . Then writing for the first elements of the vector , it follows from (16) that

 ∥Φnt,0(x)−Φnt,0(x′)∥≤e−λtγn/2∥xn−x′n∥.

Thus when , (13) ensures that as , the influence of on decays as with rate given by .

Turning to the inequalities in (11), observe that if is fixed, these inequalities are satisfied if is large enough. Further observe that if Condition 2.1b) is satisfied in the extreme case , then each (respectively ) is (respectively )-strongly concave in and then it is immediate that (13) holds. Discussion of how condition 2.1b) relates to the model ingredients , , is given in section 3.

The following lemma addresses the relationship between the cases and , hence explaining the conjunction of “decay” and “convexity” in the name we give to (13).

Lemma 1

If is twice continuously differentiable and (13) holds for some and , then it also holds with that same and .

The proof is given in section A.3. Lemma 1 can perhaps be generalized from twice to once continuous differentiability by function approximation arguments, but this is a rather technical matter which it is not our priority to pursue.

2.4 Viterbi process as the fixed point of a differential equation on l2(γ)

Define the vector field ,

 ∇U∞(x)=−{∇0~ϕ0(x)T∇1ϕ1(x)T∇2ϕ2(x)T⋯}T. (17)

An important point here about notation and interpretation: element-wise, the vector is the limit as of the vector . Indeed it can be read off from (8) that each element of the vector is constant in for all large enough. However, may not be interpreted as the gradient of the limit of the sequence of functions , because the pointwise limit is in general not well-defined. This reflects the fact that on an infinite time horizon, the prior and posterior probability measures over the entire state sequence are typically singular, so that a density “” does not exist. Hence there is no sense in characterizing the Viterbi process as: “”, a correct characterization is: , as Theorem 2 shows via a counter-part of (15)-(16) in the case .

Theorem 2

In addition to the assumptions of Theorem 1 and with as therein, assume a)-c):

a) there exists a finite constant such that for all and ,

 ∥∇nϕn(x)∥2∨∥∇n~ϕn(x)∥2≤βn+χ(∥xn−1∥2+∥xn∥2+∥xn+1∥2),

b) ,

c) is continuous in .

Then with as in Theorem 1,

 ⟨x−x′,∇U∞(x)−∇U∞(x′)⟩γ≥λ∥x−x′∥2γ,for% all x,x′∈l2(γ), (18)

and there exists a unique and globally defined flow which solves the Fréchet ordinary differential equation,

 ddtΦ∞t(x)=−∇U∞{Φ∞t(x)},Φ∞0=Id. (19)

This flow has a unique fixed point, , and for all and . Furthermore with as in Theorem 1,

 supm∈N0∪{∞},m≥n∥ξn−ξm∥2γ≤1λ2(γn−1αγ,n2χλ2+∞∑k=nγkβk). (20)

The proof of Theorem 2 is in section A.3.

The assumptions a)-b) in Theorem 2 ensure that maps to itself. Combined with the continuity in assumption c) and (18), this allows an existence and uniqueness result of (Deimling, 2006) for dissipative ordinary differential equations on Banach spaces to be applied in the proof of Theorem 2. It is from here that the Fréchet derivative (19) arises. Background information about Fréchet derivatives is given in section A.1.

3 Discussion

3.1 Bound on the segment-wise error in the parallelized scheme

The error associated with the first segment in the parallelization scheme described in section 1.1 can be bounded using (14) or (20). In this section we focus on the latter for simplicity of presentation, an application of (14) is discussed in section 3.3. The following is an immediate corollary of (20).

Corollary 2

If the assumptions of Theorem 2 hold,

 supn≥Δ+δΔ∑m=0∥ξΔ+δm−ξnm∥2 ≤γ−Δsupn≥Δ+δ∥ξΔ+δ−ξn∥2γ ≤1λ2(γδ−12χλ2αγ,Δ+δ+∞∑k=δγkβΔ+k). (21)

Recall from (4) that is the “overlap” between segments in the parallelization scheme. To see the impact of the observations , recall from (6), (7) and (9) that depends on only through where is gradient with respect to . Therefore whether or not the right hand side of (21) converges to zero as depends on the behaviour of as . There are a wide variety of assumptions on which would suffice for convergence to zero. In the particular case of stationary observations, the rate or convergence is exponential:

Corollary 3

Let the assumptions of Theorem 2 hold and let be as therein. Let be any -valued, stationary stochastic process such that . In particular, need not be distributed according the hidden Markov model described in section 1.1. Then for any , there exists a stationary process such that almost surely, and such that if in (1)-(2) the sequence is replaced by the random variables , then for any ,

 supn≥Δ+δΔ∑m=0∥ξΔ+δm−ξnm∥2≤~γδCΔ,almostsurely.

The proof is given in section A.3 and the interested reader can deduce an explicit expression for from the details there.

Bounds on the errors associated with the other segments in the parallelization scheme can be obtained by very similar arguments to those in proof of Theorem 2 but presenting all the details would involve substantial repetition.

3.2 Verifying Condition 2.1 and dimension independence of γ and λ

The following lemma provides an example of a model class satisfying Condition 2.1, allowing us to illustrate that the constants and appearing in Theorems 1 and 2 do not necessarily have any dependence on the dimension of the state-space, . Here the smallest and largest eigenvalues of a real, symmetric matrix, say , are denoted , .

Lemma 2

Assume a) and b):

a) The unobserved process satisfies

 Xn=AXn−1+b+Wn, (22)

where for , is independent of other random variables, , and are positive definite, is a matrix and and are length- vectors.

b) For each , is strictly positive, continuously differentiable and there exists such that for all and ,

 ⟨xn−x′n,∇nlogg(xn,y⋆n)−∇nlogg(x′n,y⋆n)⟩≤λg∥xn−x′n∥2. (23)

If the inequality is satisfied by:

 ζ =1+ρmin(ATA)ρmax(Σ)−λg, (24) ~ζ =1ρmax(Σ)∧{1ρmax(Σ0)+ρmin(ATA)ρmax(Σ)}−λg, (25) θ =ρmax(ATA)1/2ρmin(Σ), (26)

then Condition 2.1 holds.

The proof is in section A.4.

Remark 3

The condition (23) is called semi-log-concavity of , generalizing log-concavity by allowing , rather than only . Considering the case for ease of presentation, an example of a likelihood function which satisfies (23) for some , but not for any is the -centered Student’s t-density, for example in the case of degree of freedom: . The case also allows for multi-modality of .

Remark 4

The fact that , and in (24)-(26) depend only on eigenvalues of , and and the semi-concavity parameter means they, and consequently and , do not necessarily depend on dimension. As a simple example consider the case: , and , with and . In this situation holds, and and can be chosen to depend only on and and be such that (12) holds.

Remark 5

The condition can be interpreted as balancing the magnitude of temporal correlation in (22) against the fluctuations of and the degree to which the likelihood is informative about . As the mapping becomes more strongly log-concave, and by inspection of (24)-(26) the condition can always be achieved if takes a negative value large enough in magnitude, with other quantities on the right of the equations (24)-(26) held constant. On the other hand, if , which implies for any value of , the condition can be achieved if is small enough.

Remark 6

If the likelihood functions are sufficiently strongly log-concave, that is if (23) holds with a negative value of which is sufficiently large in magnitude, neither log-concavity of the distribution of nor linearity of the evolution equation (22) is necessary for Condition 2.1 to hold. An example illustrating this fact is given in section A.4.

3.3 Application to a factor model with stochastic volatility

Factor models are used in economics and finance to represent asset returns in terms of covariates such as market benchmarks or macroeconomic indicators, and idiosyncratic asset-specific fluctuations. Modelling of time-dependent conditional covariances within such models is an area of active research (De Nard et al., 2019; Engle et al., 2019). One way to instantiate dynamic conditional covariance is via a stochastic volatility model where, for example, the log-variances of the idiosyncratic terms are modelled in terms of a hidden process. See (Asai et al., 2006; Shephard & Andersen, 2009) for overview of stochastic volatility models. The task of maximum a-posteriori estimation of the latent log-variance process in such models was discussed by (Shephard, 1996).

We consider a model in which at time , the returns on assets are modelled in terms of factors ,

 Yn=BZn+V1/2nϵt,

where is a by matrix, is a vector of independent and identically distributed, zero-mean, unit-variance normal random variables, and is a by diagonal matrix whose th diagonal element is , and follows the vector autoregressive model specified in section 3.2.

In the case that the factors are observed with realized values , the likelihood function for is:

 g(xn,yn)=1(2π)d/2exp[−12d∑i=1xin−12d∑i=1{yin−(Bzn)i}2exp(−xin)]. (27)

For any , is log-concave, so (23) holds with . In this situation, as per remark 5, the condition in Lemma 2 and hence Condition 2.1 is satisfied for small enough. Hence Theorem 1 applies.

Due to the doubly-exponential dependence on in (27), assumption a) of Theorem 2 does not hold, so for the existence of the Viterbi process we appeal to the approach described in Corollary 1. For this purpose, let us examine the quantities on the right of inequality (14). Assuming for simplicity of presentation that ,

 βn=14d∑i=1[{(y⋆n)i−(Bzn)i}2−1]2.

Also assuming and so that a-priori the hidden process is stationary, it can be shown by elementary manipulations that the following rather crude upper bound holds:

 ηn(r)≤3rρmin(Σ)(1+1γρmax(ATA)ρmax(1+ATA)2)+14d∑i=1[{(y⋆n)i−(Bzn)i}2exp(√r)−1].

In a similar spirit to Corollary 3, let and be stationary stochastic processes such that . Then it can be shown that if and in (27) are replaced by and , for any there exists an almost surely finite random variable, say , such that:

 γn{ηn(λ−2αγ,n)∨ηn+1(γλ−2αγ,n)}≤C(γ~γ)nand∞∑n=0γnβn<∞,almostsurely. (28)

The proof of this claim is very similar to the proof of of Corollary