LargeScale Shrinkage Estimation under Markovian Dependence
1 Introduction
Shrinkage is a useful notion that provides an elegant and powerful framework for compound estimation problems. The topic has been extensively studied since the celebrated work of James and Stein (1961). A plethora of influential results obtained by various researchers show that shrinkage often leads to substantial improvements in the performances of conventional estimators. The optimal directions and magnitudes of shrinkage estimators in parametric models have been extensively studied in the literature, see Fourdrinier et al. (2017); Johnstone (2013); Ahmed and Reid (2012); Carlin and Louis (2010) for reviews of related topics. In realworld applications, parametric families of estimators often have limited usage. Nonparametric shrinkage methods, exemplified by Tweedie’s formula (Efron, 2011), have received renewed attention in largescale inference problems where thousands of parameters are estimated simultaneously; see Brown and Greenshtein (2009); Jiang and Zhang (2009); Koenker and Mizera (2014); Saha and Guntuboyina (2017); Dicker and Zhao (2016) for some important recent developments.
One essential limitation of existing nonparametric shrinkage methods is that most assume that observations are independently sampled from an underlying distribution. However, observations arising from largescale estimation problems are often dependent. Ignoring the dependence structure may result in significant loss of efficiency and invalid inference. This article aims to develop nonparametric shrinkage estimators under the widely used hidden Markov model and show that the efficiency of existing nonparametric methods can be greatly improved by incorporating the Markovian dependence structure.
Consider the problem of simultaneous estimation of a sequence of dependent parameters that are generated from a Hidden Markov Model (HMM) (Rabiner, 1989; Churchill, 1992). Markovian dependence provides a powerful tool for modeling data arising from a wide range of modern scientific applications where parameters of interest are spatially or temporally correlated. This article focuses on an important class of HMMs that have two underlying states: one being the default (null) state where the process is incontrol; the other being the abnormal (nonnull) state where the process is outofcontrol. The observed data are independent conditional on the unknown states. The twostate HMM is a popular model that can be used to describe many data sets collected from various applications. For example, in estimating the Copy Number Variations (CNV) across a genome (Efron and Zhang, 2011; Jiang et al., 2015), there are regions with high and low variations that can be well described by the nonnull and null states, respectively. In event analysis application, such as estimating hourly social media trends of a marketing campaign there are dormant (null) and active (nonnull) states which can be adaptively incorporated into an HMM for building a better tracker.
To the best of our knowledge, the important problem of nonparametric shrinkage estimation under dependence has not been studied in the literature. The key idea in our methodology is to combine the ideas in the elegant Tweedie’s formula (Brown and Greenshtein, 2009; Jiang and Zhang, 2009; Efron, 2011) and the forwardbackward algorithms in HMMs to infer the underlying states, which is further utilized to facilitate fast and robust estimation of the unknown effect sizes (mean parameters) under Markovian dependence. We establish decisiontheoretic properties of the proposed estimator and exhibit its enhanced efficacy over popular shrinkage methods developed under the independence assumption. In contrast with existing algorithms in HMMs that proceed with prespecified families of parametric densities (e.g. Gaussian mixtures), our method is nonparametric and capable of handling a wider class of distributions. Through extensive numerical experiments, we demonstrate the superior performance of our proposed algorithm over existing stateoftheart methods.
The chapter is organized as follows: in Section 2 we formulate the problem, develop an oracle estimator and compare this estimator with classical shrinkage estimators that do not incorporate HMM structure. In Section 3, we propose a datadriven procedure that mimics the oracle estimator and discuss how to overcome the new difficulties and challenges in the nonparametric approach. In Section 4, we establish theoretical properties of our proposed estimator and show that the nonparametric approach offers substantial efficiency gain over HMM algorithms employing Gaussian mixture models. In Section 5, we present numerical experiments to compare our estimator with existing methods. Section 6 applies the proposed method to three real data examples. Section 7 concludes the article with a discussion on interesting extensions.
2 Shrinkage estimation in a hidden Markov model
Consider a twostate hidden Markov model. Suppose we are interested in estimating the mean vector based on observed output . In HMMs, the observed data can be viewed as a contaminated version of the effect sizes , where the contaminations are white noises following a zeromean normal distribution with known variance . The HMM further assumes that s are independently distributed conditional on the unknown states , where are Bernoulli variables forming a Markov chain. The latent state usually represents the null or the incontrol state and corresponds to the nonnull or the outofcontrol state. For example, in CNV studies, is the observed number of repeats in the genome, represents healthy part of genome, and represents the part of the genome that is susceptible to perturbation in diseased patients.
2.1 Model and notations
Formally, the data generation process can be described by a hierarchical model:
(1) 
where given , are conditionally independent following unknown prior distributions and :
In practice, we often assume that the null distribution is either known or can be modeled by a prespecified parametric family
We assume that s follow a spatially homogeneous Markov process with transition probabilities
The above probabilities are unknown and obey standard stochastic constraints Figure 1 presents a schematic diagram of the model. We aim to find a decision rule for estimating the unknown mean vector . The goal is to minimize the Bayes risk , where is the mean squared error (MSE, or loss):
Let denote the HMM parameters, where is the transition matrix, is the initial distribution, and
We use to denote the normal density with mean 0 and variance with the suffix dropped for standard normal density.
2.2 Oracle estimator under independence
Ignoring the dependence structure, the optimal solution that minimizes the Bayes risk is given by Tweedie’s formula:
(2) 
and is the marginal distribution of all the observed . The formula first appeared in Robbins (1955), who attributed the idea to Maurice Kenneth Tweedie. Tweedie’s formula can be implemented using empirical Bayes methods for constructing a class of nonparametric estimators (Brown and Greenshtein, 2009; Efron, 2011). The crucial observation is that it works directly with the marginal distribution, which is in particular attractive in largescale estimation problems since and can be well estimated using standard density estimation techniques (Silverman, 2018).
2.3 Oracle Estimator under HMM dependence
Our idea is to extend Tweedie’s formula to correlated observations under the HMM dependence. This can effectively increase the statistical efficiency by borrowing strength from adjacent observations. It has been shown in the multiple testing literature that exploiting the dependence structure can greatly improve the power of existing false discovery rate (Benjamini and Hochberg, 1995) methods (Sun and Cai, 2009; Wei et al., 2009; Sun et al., 2015). We further show in this article that dependence structure is also highly informative and can be exploited to improve the accuracy of shrinkage estimators.
Our methodological development is divided into two steps. We first assume that an oracle knows the hidden states and study the oracle rule. Then we discuss the case when the states are unknown and propose datadriven methods to emulate the oracle rule. If the states are known, then it is natural to apply Tweedie’s formula to two separate states:
(3) 
The oracle estimator (3) provides a benchmark in shrinkage estimation, i.e. the theoretically achievable lower limit of the estimation risk.
Next we consider a “weaker” oracle estimator that only knows , the collection of hyperparameters in the HMM. Then the optimal solution is given by the next lemma.
Lemma 1
Consider Model (1) and assume that HMM parameters are known. Then the estimator that minimizes the MSE is given by
(4)  
The proof of lemma is straightforward and hence omitted. The formula (4) can be viewed as an extension of Tweedie’s formula under HMM dependence. The next section considers the case where HMM parameters are unknown. We shall develop datadriven estimators and computational algorithms to emulate the Bayes estimator (4).
3 DataDriven Estimator and Computational Algorithms
We discuss how to estimate and to implement the Bayes estimator. The proposed estimator is a nonparametric Tweediebased shrinkage estimator under dependence (TD). In light of (4), we consider a class of estimators in the form:
(5) 
where are estimates of the true conditional probability , and and are estimates of and . Next we describe an algorithm for constructing estimates of the unknown quantities in (5).
3.1 The modified BaumWelch algorithm
The most wellknown method for constructing estimators of the form (5), under the conventional setting with prespecified parametric families, is the BaumWelch (BW) algorithm (Rabiner, 1989; Yang et al., 2015; Baum et al., 1970). We emphasize that the BW algorithm must proceed with user specified parametric forms for ’s. However, while , the incontrol distribution, can be well modeled by parametric densities, the outofcontrol observations may not be easily approximated by parametric densities. For example, the Gaussian mixture model with a fixed number of components is not suitable for approximating very heavy tailed . To overcome the issue, we describe an HMMbased Tweedie (HMMT) estimator that employs parametric and nonparametric . The essential component in our estimator is a generalized BW algorithm that updates the estimates of and iteratively based on posterior probability estimates of the latent states.
To avoid the identifiability issues, assume is unimodal and . These assumptions are usually reasonable in applications. In our HMMT estimator, is assumed to be a Gaussian density with possibly unknown location and scale . The Gaussian model can be generalized to Gaussian mixtures as well as other parametric families without essential difficulty. In contrast with existing HMM algorithms that utilize parametric assumptions on , we consider a nonparametric approach to estimating . The proposed methodology employs a class of weighted kernel density estimators
(6) 
where , is the bandwidth and is a standard Gaussian kernel function. Other choices of kernel such as the Cauchy kernel can also be used particularly in applications that are sensitive to the tails of the density.
With an initial choice of , and , the forwardbackward propagation steps of the BaumWelch algorithm provides updated estimates of the posterior probabilities . These estimates can be utilized to update using new weights and :
The process is repeated until convergence.
3.2 Choice of
Choosing an appropriate bandwidth is crucial for constructing an efficient estimator. If is too small, then we will end up with estimating by the naive estimator . If is too large, then it will be difficult to identify , which leads to a severely biased and inaccurate estimator. This section describe a cross validation method for choosing the tuning parameter .
The first step is to split the observed sample into by artifically adding independent Gaussian noise: , where . Note that are normally distributed random effects with same mean but different variances and . By construction, and are mutually independent conditioned on .
The key idea in the second step is to use for estimation and for bandwidth selection. One can think of as a measure of how much data we use for estimation (Brown et al., 2013). When , the entire data set is used for estimation and none for hyperparameter calibration. This article uses .
The steps are repeated for a list of prefixed values yielding estimates of the posterior probabilities of the latent states and marginal densities estimates and . For each value, we estimate the MSE as
(7) 
where, and . The optimal value of bandwidth is chosen as:
Subsequently, the HMMT estimator is computed by running the generalized BW algorithm with the bandwidth set at and .
3.3 Initialization and the HMMT estimator
Another important caveat about Algorithm 1 is the initialization step. To prevent the algorithm getting trapped in a local maximum, a good initialization is important. For this purpose we use the algorithm developed in Ko et al. (2015) for estimation in an HMM with unknown number of changing points. As the probability of the process being in control , is estimated by the mode of the component with the largest probability which is subsequently attributed to . The remaining states as estimated via a Dirichlet process model contributes towards the outofprocess marginal densities . It was found that such an initialization produced reasonable values of . Henceforth for simplicity, we assume . The detailed estimation procedure is summarized by Algorithm 1 below.
3.4 Operational characteristics of the new algorithm
This section discusses the operation characteristics of Algorithm 1 that distinguishes our work from existing ones. The discussions provide intuitions that are helpful for our later theoretical analysis.
Algorithm 1 includes several modifications to the conventional BaumWelch algorithm of Baum et al. (1970). For understanding the crucial differences first consider , i.e, no sample splitting. The average loglikelihood for complete data assuming known is:
When is unknown, interchanging logarithm and summation above, we iteratively maximize the following lower bound of ,
where, is the entropy, and decouples as with
and . We iteratively maximize over based on previous iterate . If the posterior probabilities for are updated to then, equals
Although is concave in , it is not strongly concave. Updating by maximizing over the dimensional hyperplane as would not yield good solutions. We interchange the logarithm and summation in the first term in above. The resultant have a closed form maxima at . as demonstrated in Algorithm 1 is accordingly updated. Standard forwardbackward procedure (Rabiner, 1989) is used to maximize and produce updated posterior probabilities . For and the forward variable: and the backward variable: are computed. and the posterior probabilities are updated as:
(8) 
The objective is bounded above and increasing at each step, therefore, our algorithm will converge. In particualr, as , the true solution is a fixed point of the algorithm. This insight is crucial for establishing the theoretical properties for the proposed HMMT estimator.
4 Asymptotic Properties of the proposed estimator
For establishing risk properties of our proposed shrinkage estimators, we consider the following assumption that is popularly imposed for theoretical analysis in HMM starting from the pioneering works of Bickel and Ritov (1996); Bickel et al. (1998):
Assumption A1. The hidden states form a stationary ergodic Markov chain and the transition probabilities satisfy . The Markov chain begins in its stationary state and is reversible.
To understand the risk properties of the proposed estimator , we consider the following two quantities:
(9) 
which are helpful for understanding the behaviors of the proposed estimate of . Note that is an estimator that cannot be implemented in practice, since it uses knowledge of unknown . By contrast, is a practical estimator that substitutes in place of . , which can be conceptualized as the estimates of posterior probabilities , can be obtained as the outputs from the proposed algorithm.
Consider defined by (8) at an arbitrary iterative step. Barring sample splitting, the estimate of in every iterative step of Algorithm 1 has the functional form of . We consider a class of that are smooth, bounded and Holder continuous.
Assumption A2. comes from a smooth class of functions and has bounded support. is Holder continuous, i.e., , where
It can be seen (cf. Ch. 6 of Wasserman, 2006) that the squared distance of the oracle kernel density estimator from the true ,
has the following asymptotic expected value:
The intuition is that, under assumption A1, can be decomposed as
where the residual is asymptotically negligible as provided are asymptotically unbiased estimates of . Theory underpinning this intuition is rigorously established in the proof of the next lemma. Hence
where is a fixed probability whose existence is ensured the stationarity of the Markov chain (Assumption A1). We summarize the above discussions in the lemma below.
Lemma 2
The above results shows that with good posterior probability estimates, the algorithm 1 provides consistent nonparametric estimation of . As is parametrized as a Gaussian density, its estimation consistency also subsequently follows.
The free parameters in are and , and . For understanding the evolution of the estimates in Algorithm 1 from the initial solutions towards the true , note that under Assumption A1, the population criteria and are welldefined. Suppose our initial solution is in the vicinity of such that with estimates and that are improvements over the initial estimates and in the sense and , maximizing and the subsequent application of (8) produces better estimates of the true posterior probabilities such that given by (9) is better than with . If this feature of the evolution is maintained over the successive iterations, the bias in the posterior probability estimates decrease in every step and and will ultimately converge to the true and respectively at the rate given in Lemma 2. Having a good initial solution that has negligible asymptotic bias, which implies that successive iterations would have the aforementioned properties, is hence essential for successful convergence in Algorithm 1. Our assumption on the properties related to initialization is given next.
Assumption A3. The initial solution , , is in the neighborhood of the true solution. The behaviour of the population loglikelihood in this neighborhood is such that for all ,
the posterior estimates based on produce estimates and in Algorithm 1 which satisfy:
The results in Lemma 2 establish the risk optimality of our proposed . To facilitate a universal optimality statement that is appropriate for both nonsparse and sparse regimes, we impose the following assumption on similar to Brown and Greenshtein (2009):
Assumption A4. For any fixed positive , we have as .
Remark 1
Based on our setup in Section 2.1 and Assumption A1, the above condition is equivalent to restricting the support of the priors and to where as for any positive .
For any fixed , let , and be the final estimates of the , and the mode of respectively, from algorithm 1. Recall that as , the estimates of the mean when the process is out of control is
Barring sample splitting (using , not steps 7 and 8 in algorithm 1) the HMMT estimator is . As the HMMT estimator involves ratio of estimates and , the approximation error bound in Lemma 2 on does not trivally lead to optimal risk properties of the HMMT estimator. For achieving asymptotically optimal risk performance, consider a modified robust version of the HMMT estimator that truncates very high outofcontrol mean values:
The following result akin to Theorem 1 of Brown and Greenshtein (2009) shows that is asymptotically oracle optimal as it can achieve the mean square error of the oracle shrinkage estimator defined in (3). The proofs of the results in this section are provided in the supplements.
Theorem 4.1
Under Assumptions A1 to A4, for any such that but for any positive , we have
5 Simulation Studies
This section conducts simulation studies to compare the performance of our proposed HMMT with the following four estimators:

(a) GMM.3: we use BW algorithm and Gaussian mixture models with 3 mixing components for modeling ;

(b) GMM.DP: similar to GMM.3, we use BW algorithm and Gaussian mixture models. However, the number of components in the Gaussian mixture will not be fixed as before. Instead, the number of components is estimated using the Dirichlet Process model as described in Ko et al. (2015);

(c) T.ND: we ignore the Markovian dependence structure and apply Tweedie’s formula. The algorithm in Fu,Luella et al. (2017) has been used to choose the tuning parameter.

(d) OR: the oracle estimator in (3) which uses knowledge about the latent parameters.
5.1 Comparison of the MSEs
In Table 1 we report the mean squared errors averaged over 50 replications. We consider 11 simulation scenarios. In each scenario, we simulate observations. The latent states are generated based on a twostates Markov chain where the transition matrix has the probability of being incontrol fixed at while the probability of being outofcontrol is varied between to . Across all the scenarios was fixed as the distribution with point mass at . In cases I to VI, the outofcontrol prior was generated from the following 6 densities:

(a) uniform distribution with support on ;

(b) Asymmetric triangle distribution on with mode at ;

(c) Levy distribution with scale on ;

(d) Noncentral Chisquare distribution with degrees of freedom and noncentrality parameter at ;

(e) Weibull distribution with shape and scale ;

(f) Burr distribution with location , scale and shape parameters and . In cases VII to XI, we study the performance of the estimators when is generated from mixture of the above densities.
The following observations can be made based on the simulation results.

Across all the regimes, our proposed HMMT estimator significantly improves the nonparametric shrinkage estimator T.ND which does not use dependent structure, which shows the importance of taking the dependence structure into account.

In many cases, incorporating Markov structure result in efficiency gain even when the model is misspecified.

The HMMT estimator also improves on the Gaussian mixture model based estimation strategies and has error rates quite close to that of the Oracle estimator in (3). GMM.3 sometimes has higher MSE than even the nonparametric shrinkage estimator T.ND that does not use dependent structure. This reflects the usefulness of using Kernel density based nonparametric approach in HMMT.
5.2 Comparison of the estimated ’s
When implementing the proposed HMMT method, the estimate GMM.DP has been used in the initialization step. This indicates that, in the cases where HMMT improves significantly over GMM.DP, the proposed algorithm employs that is sufficiently far away from the Gaussian mixture family. This would typically happen when the distribution of outofcontrol averages is heavytailed. In Figure 2, the estimated across these simulation regimes are displayed for the case . Across all the studied regimes, the HMMT estimator is evidently smoother and its shape is closer to the truth.
Scenarios  Outofcontrol prior  T.ND  GMM.3  GMM.DP  HMMT  Oracle  

I  Unif[9,9]  0.2  0.345  0.213  0.137  0.130  0.112 
0.4  0.361  0.217  0.150  0.138  0.131  
0.6  0.439  0.317  0.190  0.177  0.168  
0.8  0.429  0.502  0.263  0.253  0.238  
II  Triangle  0.2  0.378  0.224  0.134  0.124  0.110 
0.4  0.294  0.242  0.154  0.136  0.122  
0.6  0.319  0.336  0.190  0.176  0.164  
0.8  0.515  0.579  0.284  0.258  0.250  
III  Levy  0.2  0.313  0.175  0.128  0.120  0.116 
0.4  0.309  0.192  0.144  0.135  0.132  
0.6  0.356  0.215  0.171  0.163  0.160  
0.8  0.468  0.375  0.240  0.235  0.232  
IV  Noncentral  0.2  0.358  0.195  0.123  0.117  0.110 
0.4  0.330  0.199  0.151  0.149  0.141  
0.6  0.340  0.246  0.171  0.167  0.159  
0.8  0.521  0.392  0.232  0.225  0.223  
V  Weibull  0.2  0.352  0.174  0.134  0.130  0.124 
0.4  0.370  0.184  0.149  0.143  0.136  
0.6  0.402  0.195  0.165  0.162  0.160  
0.8  0.447  0.289  0.239  0.233  0.229  
VI  Burr  0.2  0.329  0.191  0.128  0.122  0.116 
0.4  0.365  0.200  0.152  0.150  0.142  
0.6  0.428  0.247  0.176  0.174  0.167  
0.8  0.421  0.377  0.233  0.228  0.217  
VII  0.4 Unif[3,8] + 0.6 Levy  0.2  0.363  0.180  0.121  0.118  0.111 
0.4  0.394  0.173  0.128  0.127  0.121  
0.6  0.389  0.248  0.169  0.166  0.165  
0.8  0.506  0.367  0.230  0.223  0.221  
VIII  0.4 Noncentral + 0.6 Triangle  0.2  0.340  0.253  0.109  0.106  0.102 
0.4  0.428  0.268  0.139  0.132  0.124  
0.6  0.435  0.333  0.176  0.166  0.160  
0.8  0.541  0.591  0.239  0.228  0.226  
IX  0.4 Unif[3,8] + 0.6 Weibull  0.2  0.262  0.153  0.132  0.123  0.112 
0.4  0.345  0.163  0.136  0.135  0.126  
0.6  0.395  0.175  0.156  0.153  0.146  
0.8  0.437  0.257  0.219  0.217  0.210  
X  0.5 Weibull + 0.5 Levy  0.2  0.301  0.177  0.123  0.119  0.106 
0.4  0.321  0.196  0.142  0.138  0.129  
0.6  0.362  0.226  0.170  0.165  0.156  
0.8  0.462  0.375  0.240  0.238  0.231  
XI  0.5 Noncentral + 0.5 Burr  0.2  0.295  0.176  0.122  0.120  0.112 
0.4  0.329  0.205  0.150  0.147  0.138  
0.6  0.370  0.276  0.178  0.174  0.166  
0.8  0.456  0.404  0.244  0.242  0.235 
6 Real data analysis
This section we illustrate our methodology by applying it to several real data analyses.
6.1 Copy number variation
We analyze the IMR103 data in Sharp et al. (2006), which are displayed in Figure 3. The gene copy number is the number of copies of a particular gene in the genotype of an individual. It is widely believed that in healthy cells, the average copy number should be 2. A shift away from 2 is a genomic disorder and is usually related to certain disease. It is clear from the left panel of Figure 3 that in the region from 100000 to 110000, there is a shift away from 0 in ratio. We take and model the data using an HMM. The two hidden states 0 and 1 can be interpreted as healthy/unhealthy genes. The noise variance is estimated as the sample variance of the the first 10000 data (0.183). The right panel of Figure 3 shows the histogram of the first 10000 data along with the density function of . We will take this as .
For the data analysis, we focus on the data from position position 111000 to 112999 (Call it ) and from position 1113000 to 117999 (Call it ). For each data in we add a random noise . Define , For each data in we also add a noise from the same distribution . Let and . Under this construction, and are independent with the same mean. We will construct the estimators based on , and estimate the mean vector of (call it ). And use the average of as prediction error. The plot of and are shown in figure 14 and figure 15 respectively.
The estimated using the Gaussian mixture model and our nonparametric method are displayed in Figure 5. Use the sample splitting method we discussed in Section 3.2, the bandwidth is chosen to be 0.28. Finally we apply the GMM.DP and HMMT methods to the data set. The prediction error for GMM.DP is 0.130, whereas the prediction error of the proposed HMMT is 0.121. This illustrates the benefit of using the nonparametric approach to modeling .
6.2 Internet search trend
This section applies the proposed methods for analysis of search trend data. The key word of interest is “NBA”. The dat