Asymptotically optimal empirical Bayes inference in a piecewise constant sequence model
Inference on high-dimensional parameters in structured linear models is an important statistical problem. In this paper, for the piecewise constant Gaussian sequence model, we develop a new empirical Bayes solution that enjoys adaptive minimax posterior concentration rates and, thanks to the conjugate form of the empirical prior, relatively simple posterior computations.
Keywords and phrases: Adaptive estimation; change-point; high-dimensional; minimax; posterior concentration rate.
Consider a Gaussian sequence model
where are independent, the variance is known, and inference on the unknown mean vector is desired. It is common to assume that is sparse, i.e., most ’s are zero, and work on these problems goes back at least to Donoho and Johnstone (1994), and more recently in Johnstone and Silverman (2004), Jiang and Zhang (2009), Castillo and van der Vaart (2012), Martin and Walker (2014), and van der Pas et al. (2017).
There has also been recent interest in the piecewise constant structure, in which there is a simple partition of the set into consecutive blocks , , where
This structure is parametrized by , where is the blocking configuration and is the blocking configuration-specific mean parameters. This piecewise constant sequence model is the canonical example in the change-point literature, e.g., Frick et al. (2014) and Fryzlewicz (2014). The former paper describes applications of this piecewise constant sequence model in DNA copy number profiling and photo emission spectroscopy, and the latter in finance.
Our goal in this paper is to develop a Bayesian or, more precisely, empirical Bayesian approach for posterior inference on in the piecewise constant sequence model. Our proposed empirical prior makes use of genuine prior information that says the vector is not too complex, i.e., that the block configuration has size relatively small, but is data-dependent, or non-informative, on the actual values corresponding to the block configuration. The data is incorporated in the prior in two ways: the first is via a prior centering and the second is a mild regularization. Our theoretical results demonstrate that the corresponding empirical Bayes posterior distribution enjoys adaptive concentration at the minimax rate, recently worked out in Gao et al. (2017), even adjusting to phase transitions. To our knowledge, this is the first adaptive minimax posterior concentration rate result for this model. Moreover, since the priors for are conjugate, the posterior is relatively easy to compute, and numerical illustrations show that the empirical Bayes estimator is quite accurate even in finite samples.
2 Empirical Bayes model
In light of the representation of the mean vector in terms of a block configuration and block-specific parameters, a hierarchical prior is natural. That is, first specify a prior for , and then a conditional prior for , given . Here we will follow this strategy, but with a twist where the latter component depends on data in a certain way, making it an empirical prior.
The piecewise constant structure comes with an assumption that there are not too many blocks, i.e., that is relatively small compared to . This can be incorporated into the prior for in the following way. Set , and introduce a marginal prior
where is a constant to be specified. Note that this is effectively a truncated geometric distribution with parameter , which puts most of its mass on small values of the block configuration size, hence incorporating the prior information that is not too complex. Next, if the configuration size is given, the blocks correspond to a simple partition of into consecutive chunks, and there are such partitions. So, for the conditional prior distribution of , given , we can simply take a discrete uniform distribution. Therefore, the prior distribution for is given by
where ranges over all simple partitions of into consecutive blocks.
It remains to propose a conditional prior for , given . Here we take a prior which assigns independent normal distributions to each , , but we let the data inform the prior location. According to Martin and Walker (2014, 2017) and Martin et al. (2017), this data-driven prior centering reduces the potentially-problematic influence of the prior tails on the asymptotic behavior of the posterior. In this case, define the -specific maximum likelihood estimator , where is the average of . Then we center the normal prior on , i.e., the conditional prior for , given , is given by
where the variance and is a constant to be specified. Denote the density function, with respect to Lebesgue measure on , of this prior distribution for as .
As usual, the prior distribution for the mean vector under this hierarchical formulation is the finite mixture of with respect to the prior distribution for the configuration . That mixture prior distribution will be denoted as .
Let denote the likelihood function based on the model (1). For the posterior, we combine the prior with a power-likelihood , for , via Bayes’s formula:
Such posteriors have received some attention recently, e.g., Grünwald and van Ommen (2017), Miller and Dunson (2015), Martin and Walker (2017), Holmes and Walker (2017), and Bhattacharya et al. (2017). For an empirical Bayes perspective on this, consider rewriting the above as
This expression suggests interpreting as the posterior obtained by Bayes’s formula applied to the likelihood and a modified empirical prior .
The primary motivation for the posterior defined above is the theoretical concentration rate properties it achieves; see Section 3. The power also makes for simple and elegant proofs. We expect that the same properties would hold for the more traditional model, but this cannot be established using the arguments here. A likely more difficult analysis is needed to get optimal results for the case; compare, e.g., the proof of Theorem 1 in Martin et al. (2017) to that of Theorem 1 in Belitser and Ghosal (2017). But the rates presented in Theorem 1 below are optimal—in particular, the rates with can be no better—and, since can be arbitrarily close to 1, there is no practical difference between the and cases. Therefore, we see no obvious reason to prefer the former case over the latter.
3 Posterior concentration rates
Consider a true piecewise constant mean vector having block configuration ; here and throughout, will denote the block configuration of a vector . Define the target rate
Note that, in the case , the model is iid, and the best estimator of the constant vector would be the -vector with each entry equal to , the sample mean, and its risk is constant, as in (5). More generally, according to Corollary 3.1 in Gao et al. (2017), the rate (5) is close to the minimax optimal; see Remark 1 below. Theorem 1 says that attains the target rate in (5). Proofs are deferred to the appendix.
with the -norm on . If is such that , the above convergence property holds with replaced by a sufficiently large constant .
The prior described above does not have knowledge of the block configuration size , hence the rate is adaptive to the unknown complexity level. To our knowledge, this is the first result on adaptive posterior concentration for the piecewise constant sequence model.
The target rate (5) includes a phase transition, but the optimal rate presented in Gao et al. (2017) actually has two. Their target rate has for all with . Following the proof of Theorem 1, accommodating this phase transition would require for some constant . But is a probability mass function, so it cannot blow up like this. Since the concentration rate proofs for other posterior distributions would follow along similar lines, we expect that our inability to handle this additional phase transition is a general phenomenon for Bayesian approaches to this problem.
Next, the posterior mean is an adaptive, asymptotically minimax estimator.
Under the setup in Theorem 1, .
In addition to recovery rate results, the complexity of the posterior itself is interesting. Theorem 3 says that the effective dimension of the posterior is no larger than a multiple of the true block configuration size, i.e., the posterior is of roughly the correct complexity.
There exists a constant such that
4 Numerical results
Genuine Bayesian solutions to high-dimensional problems, ones for which optimal posterior rates are available, tend to be based on non-conjugate priors, making computation non-trivial. Our empirical Bayes solution, on the other hand, is based on a conjugate prior, making computations relatively simple. Indeed, the marginal posterior for is available in closed-form,
Sampling from via Metropolis–Hastings is easy, as is sampling from the conditional posterior of , given . R code is available at http://www4.stat.ncsu.edu/~rmartin.
First is a standard test example described in Frick et al. (2014), p. 561; see, also, Fryzlewicz (2014), Appendix B(2). This one has mean parameters, but only unique values. This true signal is depicted by the thin black line in Figure 1(a). Data is sampled from a normal distribution, centered at the true signal, with standard deviation . The empirical Bayes model described above is fit to these data, using and , and the posterior mean based on 10,000 Monte Carlo samples from is depicted by the black dots in the figure. Clearly, the posterior mean does an excellent job estimating true signal. In this case, the posterior distribution for , not shown, put almost all of its mass on the true value .
Next, we follow Fan and Guan (2017), Section 6.3, and consider a mean vector of size with unique mean parameters, equally spaced. The specific mean values are randomly sampled from a uniform distribution on . Data are simulated from a normal distribution centered at the signal with . Figure 1(b) summarizes the output of the empirical Bayes posterior, again based on and . Again, we see that the posterior mean does a good job recovering the true signal; also, the posterior distribution for , again not shown, puts almost all its mass on values 17–19, just below the true value .
An interesting possible extension of the work here is in Fan and Guan (2017). They investigate a version with a graph and, at each vertex , there is a response , but only a small number of edges have . The only obstacle preventing us from extending our analysis to this more general graph setting is the need to assign a prior distribution for the block structure in this more complex setting. But given such a prior, the theoretical results described here would carry over directly.
Another interesting and practically important extension is to the case where the mean vector is monotone. In such a case, our proposed prior for the block configuration is fine, but a prior for that respects the monotonicity is not so straightforward. A natural idea is to project the prior presented in Section 2.1 onto the cone of monotone sequences, but this projection forces positive prior mass on the boundary of the cone which introduces some additional technicalities that affect the convergence rate proofs. We hope to present results on this elsewhere.
This work is partially supported by the National Science Foundation, DMS–1737933, and the Simons Foundation Award 512620.
A.0. Preliminary results
For our theoretical analysis, it will help to rewrite the posterior distribution as the ratio , where the numerator and denominator are
is the likelihood ratio, consists of all -vectors corresponding to block configuration , and is the -vector that satisfies for , with . Since is fixed in our calculations, we will abbreviate by .
There exists such that for all large .
Define the set . Then
Under the empirical prior for , has a chi-square distribution with degrees of freedom, and the event is precisely . Using standard bounds on the chi-square distribution function, and Stirling’s approximation of the gamma function, the claim follows, with . ∎
Next, for the numerator, we consider a particular sequence of subsets, namely,
where is a sufficiently large constant, and is the target rate.
Take such that . Then , for all large , where
Towards an upper bound, we interchange expectation with the finite sum over and the integral over , the latter step justified by Tonelli’s theorem, so that
Next, we work with each of the -dependent integrands separately. Take an arbitrary and set to be its Hölder conjugate. Then Hölder’s inequality gives
On the set , since , the first term above is uniformly bounded by . To see this, note that, for a general , if denotes the joint density of under (1), and the Rényi -divergence of one normal distribution from another (e.g., van Erven and Harremoës 2014, p. 3800), then
For the second term, we show that it simplifies to a suitable constant times a normal density function in . Using the fact that is normally distributed, a simple-but-tedious moment generating function calculation gives
and the latter product is a normal density in . Integrating the upper bound with respect to gives
where . Using the formula (2) for and standard properties of a geometric series, it is easy to see that, for all large , the summation term in the above upper bound is uniformly bounded in , proving the claim. ∎
A.1. Proof of Theorem 1
Plug in the bound from Lemma 2, with to get
If , then both and the ratio in the above display are constant. Therefore, the upper bound vanishes if . On the other hand, if , then is diverging. Also, using the formula for in (2) and the standard bound, , on the binomial coefficient, we get
Since is roughly , it follows that, for fixed sufficiently large, the upper bound above vanishes as , as was to be proved.
A.2. Proof of Theorem 2
Start with . Write as , where is as defined above. Then
That remaining integral can be expressed as a ratio of numerator to denominator, where the denominator is just as in Lemma 1 and the numerator is
Take expectation of the numerator to the inside of the integral and apply Hölder’s inequality just like in the proof of Lemma 2. This gives the following upper bound on each -specific integral:
where is a constant that depends only on , , and the Hölder constant . Since the function is eventually monotone decreasing, for sufficiently large we get a trivial upper bound on the above display, i.e.,
The same argument as above bounds the remaining integral by , and the prior takes care of its contribution. Making large enough in will also take care of the bound on from Lemma 1. Therefore, the second term on the right-hand side of (7) is also bounded by a multiple of . Finally, Jensen’s inequality gives , and the claim follows.
A.3. Proof of Theorem 3
The same Hölder’s inequality approach leads to an upper bound like . Then
From (2), factor out a common from the summation, which will be the dominant term. Indeed, like in the proof of Theorem 1, the ratio in the above display is of order . Then the right-hand side above is of order . For sufficiently large , the negative term dominates, so the upper bound vanishes, proving the claim.
- Abramovich et al. (2006) Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist., 34(2):584–653.
- Belitser and Ghosal (2017) Belitser, E. and Ghosal, S. (2017). Empirical Bayes oracle uncertainty quantification. Unpublished manuscript, http://www4.stat.ncsu.edu/~ghoshal/papers/oracle_regression.pdf.
- Bhattacharya et al. (2017) Bhattacharya, A., Pati, D., and Yang, Y. (2017). Bayesian fractional posteriors. Unpublished manuscript, arXiv:1611.01125.
- Castillo and van der Vaart (2012) Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: posterior concentration for possibly sparse sequences. Ann. Statist., 40(4):2069–2101.
- Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk over -balls for -error. Probab. Theory Related Fields, 99(2):277–303.
- Fan and Guan (2017) Fan, Z. and Guan, L. (2017). Approximate -penalized estimation of piecewise constant signals on graphs. Ann. Statist., to appear; arXiv:1703.01421.
- Frick et al. (2014) Frick, K., Munk, A., and Sieling, H. (2014). Multiscale change point inference. J. R. Stat. Soc. Ser. B. Stat. Methodol., 76(3):495–580. With 32 discussions by 47 authors and a rejoinder by the authors.
- Fryzlewicz (2014) Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. Ann. Statist., 42(6):2243–2281.
- Gao et al. (2017) Gao, C., Han, F., and Zhang, C.-H. (2017). Minimax risk bounds for piecewise constant models. Unpublished manuscript, arXiv:1705.06386.
- Grünwald and van Ommen (2017) Grünwald, P. and van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal., to appear; arXiv:1412.3730.
- Holmes and Walker (2017) Holmes, C. C. and Walker, S. G. (2017). Assigning a value to a power likelihood in a general Bayesian model. Biometrika, 104(2):497–503.
- Jiang and Zhang (2009) Jiang, W. and Zhang, C.-H. (2009). General maximum likelihood empirical Bayes estimation of normal means. Ann. Statist., 37(4):1647–1684.
- Johnstone and Silverman (2004) Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Statist., 32(4):1594–1649.
- Martin (2017) Martin, R. (2017). Empirical priors and posterior concentration rates for a monotone density. Unpublished manuscript, arXiv:1706.08567.
- Martin et al. (2017) Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli, 23(3):1822–1847.
- Martin and Walker (2014) Martin, R. and Walker, S. G. (2014). Asymptotically minimax empirical Bayes estimation of a sparse normal mean vector. Electron. J. Stat., 8(2):2188–2206.
- Martin and Walker (2017) Martin, R. and Walker, S. G. (2017). Empirical priors for target posterior concentration rates. Unpublished manuscript, arXiv:1604.05734.
- Miller and Dunson (2015) Miller, J. W. and Dunson, D. B. (2015). Robust Bayesian inference via coarsening. Unpublished manuscript, arXiv:1506.06101.
- van der Pas et al. (2017) van der Pas, S., Szabó, B., and van der Vaart, A. (2017). Adaptive posterior contraction rates for the horseshoe. Electron. J. Stat., 11(2):3196–3225.
- van Erven and Harremoës (2014) van Erven, T. and Harremoës, P. (2014). Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inform. Theory, 60(7):3797–3820.