State-by-state Minimax Adaptive Estimation for Nonparametric Hidden Markov Models
This paper considers the problem of estimating the emission densities of a nonparametric finite state space hidden Markov model in a way that is state-by-state adaptive and leads to minimax rates for each emission density–as opposed to globally minimax estimators, which adapt to the worst regularity among the emission densities. We propose a model selection procedure based on the Goldenschluger-Lepski method. Our method is computationally efficient and only requires a family of preliminary estimators, without any restriction on the type of estimators considered. We present two such estimators that allow to reach minimax rates up to a logarithmic term: a spectral estimator and a least squares estimator. Finally, numerical experiments assess the performance of the method and illustrate how to calibrate it in practice. Our method is not specific to hidden Markov models and can be applied to nonparametric multiview mixture models.
Keywords: hidden Markov models; model selection; nonparametric density estimation; adaptive minimax estimation; spectral method; least squares method.
Finite state space hidden Markov models, or HMMs in short, are powerful tools for studying discrete time series and have been used in a variety of applications such as economics, signal processing and image analysis, genomics, ecology, speech recognition and ecology among others. The basic idea is similar to mixture models, where each observation depends only on the population it comes from. In a HMM, each observation depends only on a hidden variable, its hidden state, with markovian regime.
Formally, a hidden Markov model is a process where is a Markov chain on and the ’s are independent conditionally on with a distribution depending only on . The parameters of the HMM are the initial distribution and transition matrix of the Markov chain and the emission distributions, that is the distributions where is the distribution of conditionally to . The learning problem is to infer the parameters from the observations only.
In this article, we focus on estimating the emission distributions in a nonparametric setting. More specifically, assume that the emission distributions have a density with respect to some known dominating measure , and write their densities–which we call the emission densities. The goal of this paper is to estimate all ’s with the best possible convergence rate when the emission densities are not restricted to belong to a set of densities described by finitely many parameters.
Nonparametric state-by-state adaptivity.
Theoretical results in the nonparametric setting have only been developed recently. de Castro et al. (to appear) and Bonhomme et al. (2016b) introduce spectral methods, and the latter is proved to be minimax but not adaptive. de Castro et al. (2016) introduce a least squares estimator which is shown to be minimax adaptive up to a logarithmic term. However, all these papers have a common drawback: they study the emission densities as a whole and do not handle them separately. This translates into their error criterion, which is the supremum of the errors on all densities. Thus, what they actually prove is that converges with minimax rate when are their density estimators. Such results are not entirely satisfactory, since the regularity of each emission density could be different, leading to different convergence rates. In particular, this means that having just one emission density that is very hard to estimate is enough to deteriorate the convergence rate of all emission densities. Solving this problem requires a different approach.
This is the main point of our paper. We show that it is possible to reach state-by-state adaptive minimax convergence rate. To do so, we propose a new method which is able to estimate each emission density separately.
The method we propose is based on the Goldenschluger-Lepski method developed in the seminal papers by Goldenshluger and Lepski (2011, 2014) for density estimation, extended by Goldenshluger and Lepski (2013) to the white noise and regression models. It consists in using an auxiliary family of estimators and choosing the estimator that performs a good bias-variance tradeoff separately for each hidden state.
The method and assumptions are detailed in Section 2. Let us give a quick overview of the method. Given estimators of the emission densities for each hidden state and each model in a given family of models, one computes a substitute for the bias of the estimators by
for some penalty . Then, for each state , one selects the estimator from the model minimizing the quantity . The penalty can also be interpreted as a variance bound, so that this penalization procedure can be seen as performing a bias-variance tradeoff. The novelty of this method is that it selects a different , that is a different model, for each hidden state: this is where the state-by-state adaptivity comes from.
The main theoretical result is an oracle inequality on the selected estimators , see Theorem 2. As a consequence, we are able to get a convergence rate that is different for each state. These convergence rates will even be adaptive minimax up to a logarithmic factor when the method is applied to our two preliminary estimators: a spectral estimator and a least squares estimator. To the best of our knowledge, this is the first state-by-state adaptive algorithm for hidden Markov models.
Note that finding the right penalty term is essential in order to obtain minimax convergence rates. This requires a fine theoretical control of the variance of the auxiliary estimators, in the form of assumption [H] (see Section 2.1). To the best of our knowledge, there is no suitable result in the literature. This is the second theoretical contribution of this paper: we control two auxiliary estimators in a way that makes it possible to reach adaptive minimax rate with our state-by-state selection method, up to a logarithmic term.
On a practical side, we run this method and several variants on data simulated from a HMM with three hidden states and one irregular density, as illustrated in Section 4. The simulations confirm that it converges with a different rate for each emission density, and that the irregular density does not alter the convergence rate of the other ones, which is exactly what we wanted to achieve.
Better still, the added computation time is negligible compared to the computation time of the estimators: even for the spectral estimators of Section 3.2 (which can be computed much faster than the least squares estimators and the maximum likelihood estimators using EM), computing the estimators on 200 models for 50,000 observations (the lower bound of our sample sizes) of a 3-states HMM requires a few minutes, compared to a couple of seconds for the state-by-state selection step. The difference becomes even larger for more observations, since the complexity of the state-by-state selection step is independent of the sample size: for instance, computing the spectral estimators on 300 models for 2,200,000 observations requires a bit less than two hours, and a bit more than ten hours for 10,000,000 observations, compared to less than ten seconds for the selection step in both cases. We refer to Section 4.5 for a more detailled discussion about the algorithmic complexity of the algorithms.
We use two methods to construct auxiliary estimators and apply the selection estimator. The motivation and key result of this part of the paper is to control the variances of the estimators by the right penalty . This part is crucial if one wants to get adaptive minimax rates, and has not been adressed in previous papers. For both methods, we develop new theoretical results that allow to obtain a penalty that leads to adaptive minimax convergence rates up to a logarithmic term. We present the algorithms and theoretical guarantees in Section 3.
The first method is a spectral method and is detailed in Section 3.2. Several spectral algorithms were developed, see for instance Anandkumar et al. (2012) and Hsu et al. (2012) in the parametric setting, and Bonhomme et al. (2016b) and de Castro et al. (to appear) in a nonparametric framework. The main advantages of spectral methods are their computational efficiency and the fact that they do not resort to optimization procedure such as the EM and more generally nonconvex optimization algorithm, thus avoiding the well-documented issue of getting stuck into local sub-optimal minima.
Our spectral algorithm is based on the one studied in de Castro et al. (to appear). However, their estimator cannot reach the minimax convergence rate: the variance bound deduced from their results is proportional to , while reaching the minimax rate requires to be proportional to . To solve this issue, we introduce a modified version of their algorithm and show that it has the right variance bound, so that it is able to reach the adaptive minimax rate after our state-by-state selection procedure, up to a logarithmic term. Our algorithm also has an improved complexity: it is at most quasi-linear in the number of observations and in the model dimension, instead of cubic in the model dimension for the original algorithm.
The second method is a least squares method and is detailed in Section 3.3. Nonparametric least squares methods were first introduced by de Castro et al. (2016) to estimate the emission densities and extended by Lehéricy (2016) to estimate all parameters at once. They rely on estimating the density of three consecutive observations of the HMM using a least squares criterion. Since the model is identifiable from the distribution of three consecutive observations when the emission distributions are linearly independent, it is possible to recover the parameters from this density. In practice, these methods are more accurate than the spectral methods and are more stable when the models are close to not satisfying the identifiability condition, see for instance de Castro et al. (2016) for the accuracy and Lehéricy (2016) for the stability. However, since they rely on the minimization of a nonconvex criterion, the computation times of the corresponding algorithms are often longer than the ones from spectral methods.
A key step in proving theoretical guarantees for least squares methods is to relate the error on the density of three consecutive observations to the error on the HMM parameters in order to obtain an oracle inequality on the parameters from the oracle inequality on the density of three observations. More precisely, the difficult part is to lower bound the error on the density by the error on the parameters. Let us write and the densities of the first three observations of a HMM with parameters and respectively (these parameters actually correspond to the transition matrix and the emission densities of the HMM). Then one would like to get
where is the natural distance on the parameters and is a positive constant which does not depend on . Such inequalities are then used to lower bound the variance of the estimator of the density of three observations by the variance of the parameter estimators: let be the projection of and be the estimator of on the current approximation space (with index ). Denote and the corresponding parameters and assume that the error is bounded by some constant , then the result will be that
Such a result is crucial to control the variance of the estimators by a penalty term , which is the result we need for the state-by-state selection method. In the case where only the emission densities vary, de Castro et al. (2016) proved that such an inequality always holds for HMMs with 2 hidden states using brute-force computations, but it is still unknown whether it is always true for larger number of states. When the number of states is larger than 2, they show that this inequality holds under a generic assumption. Lehéricy (2016) extended this result to the case where all parameters may vary. However, the constants deduced from both articles are not explicit, and their regularity (when seen as a function of ) is unknown, which makes it impossible to use in our setting: one needs the constants to be lower bounded by the same positive constant, which requires some sort of regularity on the function in the neighborhood of the true parameters.
To solve this problem, we develop a finer control of the behaviour of the difference , which is summarized in Theorem 10. We show that it is possible to assume to be lower semicontinuous and positive without any additional assumption. In addition, we give an explicit formula for the constant when and are close, which gives an explicit bound for the asymptotical convergence rate. This result allows us to control the variance of the least squares estimators by a penalty which ensures that the state-by-state method reaches the adaptive minimax rate up to a logarithmic term.
Section 4 shows how to apply the state-by-state selection method in practice and shows its performance on simulated data.
Note that the theoretical results give a penalty term known only up to a multiplicative constant which is unknown in practice. This problem, the penalty calibration issue, is usual in model selection methods. It can be solved using algorithms such as the dimension jump heuristics, see for instance Birgé and Massart (2007), who introduce this heuristics and prove that it leads to an optimal penalization in the special case of Gaussian model selection framework. Nonetheless, this method has been shown to behave well in practice in a variety of domains, see for instance Baudry et al. (2012). We describe the method and show how to use this heuristics to calibrate the penalties in Section 4.2.
We propose and compare several variants of our algorithm. Section 4.2 shows some variants in the calibration the penalties and Section 4.3 shows other ways to select the final estimator. We discuss the result of the simulations and the convergence of the selected estimators in Section 4.4. Finally, Section 4.5 discusses the complexities of the auxiliary estimation methods and of the selection procedures.
Section 5 contains a conclusion and perspectives for this work.
Finally, Section 6 is dedicated to the proofs.
We will use the following notations throughout the paper.
is the set of integers between 1 and .
is the set of permutations of .
is the Frobenius norm.
is the linear space spanned by the family .
are the singular values of the matrix .
For , is the Gram matrix of , defined by for all .
2 The state-by-state selection procedure
In this section, we introduce the framework and our state-by-state selection method.
2.1 Framework and assumptions
Let be a Markov chain with finite state space of size . Let be its transition matrix and be its initial distribution. Let be random variables on a measured space with -finite such that conditionally on the ’s are independent with a distribution depending only on . Let be the distribution of conditionally to . Assume that has density with respect to . We call the emission distributions and the emission densities. Then is a hidden Markov model with parameters . The hidden chain is assumed to be unobserved, so that the estimators are based only on the observations .
Let be a nested family of finite-dimensional subspaces such that their union is dense in . For each index , we write the projection of on .
In order to estimate the emission densities, we will only use some of the . Let be the set of indices which will be used for the estimation from observations. For each , we assume we are given an estimator . We will need to assume that for all models, the distance between and is small with high probability. In the following, we will drop the dependency in and simply write and .
The following result is what one usually obtains in model selection. It bounds the distance between the estimators and the projections by some penalty function . Thus, can be seen as a bound of the variance term.
With probability ,
where the upper bound is nondecreasing in . We show in Sections 3.2 and 3.3 how to obtain such a result for a spectral method and for a least squares method (using an algorithm from Lehéricy (2016)). In the following, we omit the parameters and in the notations and only write .
What is important for the selection step is that the permutation does not depend on the model : one needs all estimators to correspond to the same emission density, namely when is the same for all models . This can be done in the following way: let and let
for all . Then, consider the estimators obtained by swapping the hidden states by these permutations. In other words, for all , consider
Now, assume that the error on the estimators is small enough. More precisely, write the distance between the projections of on the models and and assume that (that is twice the upper bound of the distance between two estimated emission densities corresponding to the same hidden states in models and ) is smaller than , which is the smallest distance between two different densities of the vector .
Then [H] ensures that with probability as least , for all , there exists a single component of that is closer than of , and this component will be by definition. This is summarized in the following lemma.
Assume [H] holds. Then with probability , there exists a permutation such that for all and for all such that
Proof in Section 6.1
Thus, this property holds asymptotically as soon as tends to infinity and tends to zero.
2.2 Estimator and oracle inequality
For each and , let
serves as a replacement for the bias of the estimator , as can be seen in Equation (2). This comes from the fact that for large , the quantity is upper bounded by the variances and (which are bounded by ) plus the bias . Thus, only the bias term remains after substracting the variance bound .
Then, for all , select a model through the bias-variance tradeoff
and finally take
The following theorem shows an oracle inequality on this estimator.
Let and assume equation (1) holds for all with probability . Then with probability ,
Proof We restrict ourselves to the event of probability at least where equation (1) holds for all .
The first step consists in decomposing the total error: for all and ,
From now on, we will omit the subscripts and . Using equation (1) and the definition of and , one gets
Then, notice that can be bounded by
Since is nondecreasing, , so that the first term is upper bounded by zero thanks to equation (1). The second term can be controlled since the orthogonal projection is a contraction. This leads to
which is enough to conclude.
The oracle inequality also holds when taking
This positivity condition appears in the original Goldenschluger-Lepski method.
Note that the selected implicitely depends on the probability of error through the penalty .
In the asymptotic setting, we take as a function of , so that is a function of only. This will be used to get rid of when proving that the estimators reach the minimax convergence rates.
3 Plug-in estimators and theoretical guarantees
In this section, we introduce two methods to construct preliminary estimators of the emission densities. We show that they satisfy assumption [H] for a given variance bound .
3.1 Framework and assumptions
Recall that we approximate by a nested family of finite-dimensional subspaces such that their union is dense in and write the orthogonal projection of on for all and . We assume that and that the space has dimension . A typical way to construct such spaces is to take spanned by the first vectors of an orthonormal basis.
Both methods will construct an estimator of the emission densities for each model of this family. These estimators will then be plugged in the state-by-state selection method of Section 2.2, which will select one model for each state of the HMM.
We will need the following assumptions.
is a stationary ergodic Markov chain with parameters ;
is invertible and the family is linearly independent.
The ergodicity assumption in [HX] is standard in order to obtain convergence results. In this case, the initial distribution is forgotten exponentially fast, so that the HMM will essentially behave like a stationary process after a short period of time. For the sake of simplicity, we assume the Markov chain to be stationary.
[Hid] appears in identifiability results, see for instance Gassiat et al. (2015) and Theorem 8. It is sufficient to ensure identifiability of the HMM from the law of three consecutive observations. Note that it is in general not possible to recover the law of a HMM from two observations (see for instance Appendix G of Anandkumar et al. (2012)), so that three is actually the minimum to obtain general identifiability.
3.2 The spectral method
Algorithm 1 is a variant of the spectral algorithm introduced in de Castro et al. (to appear). Unlike the original one, it is able to reach the minimax convergence rate thanks to two improvements. The first one consists in decomposing the joint density on different models, hence the use of two dimensions and . The second one consists in codiagonalizing the matrices for several random rotations instead of just one, and selecting the rotation that separates the singular values of the “best”, hence the parameter . These additional parameters do not actually add much to the complexity of the algorithm: in theory, the choice is fine (see Corollary 6), and in practice, any large enough constant works, see Section 4 for more details.
For all , let be an orthonormal basis of . Let
The following theorem follows the proof of Theorem 3.1 of de Castro et al. (to appear), with modifications that allow to control the error of the spectral estimators in expectation and are essential to obtain the right convergence rates in Corollary 6.
Assume [HX] and [Hid] hold. Then there exists a constant depending on and constants and depending on and such that for all , for all such that and for all , with probability greater than ,
Proof in Section 6.2.
Let us now apply the state-by-state selection method to these estimators. The following corollary shows that it is possible to reach the minimax convergence rate up to a logarithmic term separately for each state under standard assumptions. Note that we need to bound the resulting estimators by some power of , but this assumption is not very restrictive since can be arbitrarily large.
Assume [HX] and [Hid] hold. Also assume that for a constant and that for all , there exists such that . Then there exists a constant depending on and such that the following holds.
Let and . Let be the estimators selected from the family with , and for all . Then there exists a sequence of random permutations such that
The novelty of this result is that each emission density is estimated with its own convergence rate: the rate is different for each emission density, even though the original spectral estimators did not handle them separately. This is due to our state-by-state selection method.
Moreover, it is able to reach the minimax rate for each density in an adaptive way. For instance, in the case of a -Hölder density on (equipped with a trigonometric basis), one can easily check the control of , and the control follows from standard approximation results, see for instance DeVore and Lorentz (1993). Thus, our estimators converge with the rate to this density: this is the minimax rate up to a logarithmic factor.
By aligning the estimators like in Section 2.1, one can replace the sequence of permutations in Corollary 6 by a single permutation, in other words there exists a random permutation which does not depend on such that
This means that the sequence is an adaptive rate-minimax estimator of –or more precisely of one of the emission densities , but since the distribution of the HMM is invariant under relabelling of the hidden states, one can assume the limit to be without loss of generality–up to a logarithmic term.
At this point, it is important to note that the choice of the constant depends on the hidden parameters of the HMM and as such is unknown. This penalty calibration problem is very common in the model selection framework and can be solved in practice using methods such as the slope heuristics or the dimension jump method which have been proved to be theoretically valid in several cases, see for instance Baudry et al. (2012) and references therein. We use the dimension jump method and explain its principle and implementation in Section 4.2.
Proof Using Theorem 5, one gets that for all and for all such that , with probability ,
where with such that .
The condition on becomes
and is asymptotically true for all as soon as .
Thus, [H()] is true for the family . Note that the assumption also implies that there exists such that for large enough, Lemma 1 holds for all , so that Theorem 2 implies that for large enough, there exists a permutation such that with probability , for all ,
where the trade-off is reached for , which is in for large enough.
Finally, write the event of probability smaller than where [H()] doesn’t hold, then for large enough and for all ,
3.3 The penalized least squares method
Let be a subset of . We will need the following assumption on in order to control the deviations of the estimators:
, is closed under projection on for all and
with and larger than 1.
A simple way to construct such a set when is a finite measure is to take the sets spanned by the first vectors of an orthonormal basis whose first vector is proportional to . Then any set of densities such that , and for given constants and and for all satisfies [HF].
When , and , let
When is a probability distribution, a transition matrix and a -uple of probability densities, then is the density of the first three observations of a HMM with parameters . The motivation behind estimating is that it allows to recover the true parameters under the identifiability assumption [Hid], as shown in the following theorem.
Let be the set of transition matrices on and the set of probability distributions on . For a permutation , write its matrix (that is the matrix defined by ). Finally, define the distance on the HMM parameters
This distance is invariant under permutation of the hidden states. This corresponds to the fact that a HMM is only identifiable up to relabelling of its hidden states.
Theorem 8 (Identifiability)
Let such that for all and [Hid] holds. Then for all ,
The spectral algorithm of de Castro et al. (to appear) applied on the finite dimensional space spanned by the components of and allows to recover all the parameters even when the emission densities are not probability densities and when the Markov chain is not stationary.
Define the empirical contrast
where and are the observations. It is a biased estimator of the loss: for all ,
where . Since the bias does not depend on the function , one can hope that the minimizers of