1.  Introduction

Bayesian Predictive Densities

Based on Latent Information Priors

Fumiyasu Komaki

Department of Mathematical Informatics

Graduate School of Information Science and Technology, the University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, JAPAN

Summary

Construction methods for prior densities are investigated from a predictive viewpoint. Predictive densities for future observables are constructed by using observed data. The simultaneous distribution of future observables and observed data is assumed to belong to a parametric submodel of a multinomial model. Future observables and data are possibly dependent. The discrepancy of a predictive density to the true conditional density of future observables given observed data is evaluated by the Kullback-Leibler divergence. It is proved that limits of Bayesian predictive densities form an essentially complete class. Latent information priors are defined as priors maximizing the conditional mutual information between the parameter and the future observables given the observed data. Minimax predictive densities are constructed as limits of Bayesian predictive densities based on prior sequences converging to the latent information priors.

AMS 2010 subject classifications: 62F15, 62C07, 62C20.

Keywords: essentially complete class, Jeffreys prior, Kullback-Leibler divergence, minimaxity, multinomial model, reference prior.

## 1.  Introduction

We construct predictive densities for future observables by using observed data. Future observables and data are possibly dependent and the simultaneous distribution of them is assumed to belong to a submodel of a multinomial model. Various practically important models such as categorical models and graphical models are included in this class.

Let and be finite sets composed of and elements, and let and be random variables that take values in and , respectively. Let be a set of probability densities on . The model is regarded as a submodel of the -nominal model with trial number 1. Here, we do not lose generality by assuming the trial number is . The model is naturally regarded as a subset of the hyperplane in Euclidean space . In the following, we identify with . Then, the parameter space is endowed with the induced topology as a subset of .

A predictive density is defined as a function from to satisfying . The closeness of to the true conditional probability density is evaluated by the average Kullback-Leibler divergence:

 R(θ,q)=∑x,yp(x,y|θ)logp(y|x,θ)q(y;x), (1)

where we define , , . Although the conditional probability is not uniquely defined when , the risk value is uniquely determined because if .

First, we show that, for every predictive density , there exists a limit of Bayesian predictive densities

 pπn(y|x):=∫p(x,y|θ)dπn(θ)∫p(x|θ)dπn(θ),

where is a prior sequence, such that for every . In the terminology of statistical decision theory, this means that the class of predictive densities that are limits of Bayesian predictive densities is an essentially complete class.

Next, we investigate latent information priors defined as priors maximizing the conditional mutual information between and given . We obtain a constructing method for a prior sequence converging the latent information prior, based on which a minimax predictive density is obtained. We consider limits of Bayesian predictive densities to deal with conditional probabilities.

There exist important previous studies on prior construction by using the unconditional mutual information. The reference prior by Bernardo (1979), (2005) is a prior maximizing the mutual information between and in the limit of the amount of information of goes to infinity. It corresponds to the Jeffreys prior if there are no nuisance parameters; see Ibragimov and Hasminskii (1973) and Clarke and Barron (1994) for rigorous treatments. In coding theory, the prior maximizing the mutual information between and is used for Bayes coding. It was shown that the Bayes codes for finite alphabet models based on the priors are minimax by Gallager (1979) and Davisson and Leon-Garcia (1980). In our framework, these settings correspond to prediction of without . In statistical applications, plays an important role because it corresponds to observed data, although is an empty set in the reference analysis and the standard framework of information theory; see also Komaki (2004) for the relation between statistical prediction and Bayes coding.

Geisser (1978), in the discussion of Bernardo (1978), discussed minimax prediction based on the risk function (1) as an alternative to the reference prior approach.

The latent information priors introduced in the present paper bridge these two approaches. The theorems obtained below clarify the relation between the conditional mutual information and minimax prediction based on observed data.

For Bayesian prediction of future observables by using observed data, Akaike (1983) discussed priors maximizing the mutual information between and and called them minimum information priors. Kuboki (1998) also proposed priors for Bayesian prediction based on an information theoretic quantity. These priors are different from latent information priors investigated in the present paper.

In section 2, we prove that, for every predictive density , there exists a predictive density that is a limit of Bayesian predictive densities whose performance is not worse than that of . In section 3, we introduce a construction method for minimax predictive densities as limits of Bayesian predictive densities. The method is based on the conditional mutual information between and given . In section 4, we give some numerical results and discussions.

## 2.  Limits of Bayesian predictive densities

In this section, we prove that the class of predictive densities that are limits of Bayesian predictive densities is an essentially complete class.

Throughout this paper, we assume the following conditions:

Assumption 1. is compact.

Assumption 2. For every , there exists such that .

These assumptions are not restrictive. For Assumption 1, if is not compact, we can regard the closure as the parameter space instead of because we consider a submodel of a multinomial model. We do not lose generality by Assumption 2 because we can adopt instead of if there exists such that for every .

We prepare several preliminary results to prove Theorem 1 below.

Let be the set of all probability measures on endowed with the weak convergence topology and the corresponding Borel algebra. By the Prohorov theorem and Assumption 1, is compact.

When and are fixed, the function is bounded and continuous. Thus, for every fixed , the function

 π∈P⟼pπ(x,y):=∫p(x,y|θ)d% π(θ)

is continuous, because of the definition of weak convergence. Therefore, for every predictive density , the function from to defined by

 Dq(π):= ∑x,ypπ(x,y)logpπ(x,y)q(y;x)pπ(x) = ∑x,ypπ(x,y)logpπ(x,y)−∑xpπ(x)logpπ(x)−∑(x,y):q(y;x)>0pπ(x,y)logq(y;x) −∑(x,y):q(y;x)=0pπ(x,y)logq(y;x) (2)

is lower semicontinuous, because the last term in (2) is lower semicontinuous and the other terms are continuous.

Lemma 1. Let be a probability measure on . Then, is a closed subset of .

Proof. Suppose that is the limit of a convergent sequence in . Since ,

 ∫f(θ)dπk(θ)−ε∫f(θ)dμ(θ)≥0

for every nonnegative bounded continuous function on . Thus,

 ∫f(θ)dπ∞(θ)=limk→∞∫f(θ)dπk(θ)≥ε∫f(θ)dμ(θ).

Hence, is a nonnegative measure. Therefore, , and is a closed set in .

Lemma 2. Let be a continuous function from to , and let be a probability measure on such that for every . Then, there is a probability measure in

such that . Furthermore, there exists a convergent subsequence of such that the equality holds, where .

Proof. Note that there exists such that for every by Assumption 2. By Lemma 1, the sets are compact because they are closed subsets of a compact set . Thus, there is a probability measure in such that . There exists a convergent subsequence of because is compact.

Since is compact and is a continuous function of , there exists such that . Thus, , where . For every , there exists such that , where is the Prohorov metric on . We put

 ^πn=1nμ+n−1n^π     (n=1,2,3,…).

Then, and . Thus, for every , there exists a positive integer such that . If , then . Since is arbitrary, we have . Therefore, .

The conditional probability is not uniquely specified if . To resolve the problem, we consider a sequence of priors that satisfies for every and . In the following, is defined to be a map from to the limit of the real number sequence . If there exist limits of sequence of real numbers for all , we say the limit of Bayesian predictive densities exists. Obviously, if the limit exists, it is a predictive density because for every and for every .

Theorem 1.

• For every predictive density , there exists a convergent prior sequence such that the limit exists and for every .

• If there exists such that and for every , then for every predictive density and .

Proof. 1)  Let and . Let be the set of all probability measures on . Then, and are compact subsets of and , respectively.

If , the assertion is obvious, because for . We assume that in the following. Let such that and be a probability measure on such that for every .

Then, because defined by (2) as a function of is continuous, there exists such that . From Lemma 2, there exists a convergent subsequence of such that , where .

Let be the integer satisfying . We can take a subsequence such that for some positive constant . Since

 nmnm+1π′m+(1−nmnm+1)δθ=nmnm+1πnm+(1−nmnm+1)δθ∈Pqμq/nm+1

for every , where is the probability measure on satisfying , we have

 ~πm,θ,u:=u{nmnm+1π′m+(1−nmnm+1)δθ}+(1−u)π′m+1∈Pqμq/nm+1

for every and . Thus,

 ∂∂u Dq(~πm,θ,u)∣∣∣u=0=∂∂u∑(x,y)∉Nqp~πm,θ,u(x,y)logp~πm,θ,u(x,y)q(y;x)p~πm,θ,u(x)∣∣∣u=0 = ∑(x,y)∉Nq{∂∂up~πm,θ,u(x,y)∣∣∣u=0}logpπ′m+1(x,y)q(y;x)pπ′m+1(x) = nmnm+1∑(x,y)∉Nqpπ′m(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x)−∑(x,y)∉Nqpπ′m+1(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x) +nm+1−nmnm+1∑(x,y)∉Nqp(x,y|θ)logpπ′m+1(x,y)q(y;x)pπ′m+1(x)≥0.

Hence,

 ∑(x,y)∉Nqp(x,y|θ)logpπ′m+1(x,y)q(y;x)pπ′m+1(x) ≥ nm+1nm+1−nm∑(x,y)∉Nqpπ′m+1(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x)−nmnm+1−nm∑(x,y)∉Nqpπ′m(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x) = nm+1nm+1−nm∑(x,y)∉Nqpπ′m+1(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x) +nmnm+1−nm{−∑(x,y)∉Nq∪Nπ′∞pπ′m(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x)−∑(x,y)∈Nπ′∞∖Nqpπ′m(x,y)logpπ′m+1(y|x) +∑(x,y)∈Nπ′∞∖Nqpπ′m(x,y)logq(y;x)}
 ≥ nm+1nm+1−nm∑(x,y)∉Nqpπ′m+1(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x) +nmnm+1−nm{−∑(x,y)∉Nq∪Nπ′∞pπ′m(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x)+∑(x,y)∈Nπ′∞∖Nqpπ′m(x,y)logq(y;x)}, (3)

where . Here, we have

 limm→∞∑(x,y)∉Nq∪Nπ′∞pπ′m(x,y)logpπ′m+1(x,y)q(y;x)pπ′m+1(x)=∑(x,y)∉Nq∪Nπ′∞pπ′∞(x,y)logpπ′∞(x,y)q(y;x)pπ′∞(x), (4)

because for every , and

 limm→∞∑(x,y)∈Nπ′∞∖Nqpπ′m(x,y)logq(y;x)=0=−∑(x,y)∈Nπ′∞∖Nqpπ′∞(x,y)logpπ′∞(x,y)q(y;x)pπ′∞(x). (5)

Therefore, from (3), (4), (5), and , for every ,

 liminfm→∞ (6)

By taking an appropriate subsequence of , we can make the sequences of real numbers converge for all because and .

Then, from (6), if ,

 R (θ,limk→∞pπ′′k(y|x))=∑x,yp(x,y|θ)logp(y|x,θ)limk→∞pπ′′k(y|x)=∑(x,y)∉Nqp(x,y|θ)logp(y|x,θ)limk→∞pπ′′k(y|x) ≤ ∑(x,y)∉Nqp(x,y|θ)logp(y|x,θ)q(y;x)=∑x,yp(x,y|θ)logp(y|x,θ)q(y;x)=R(θ,q(y;x))<∞.

Note that the risk does not depend on the choice of for , although is not uniquely determined for such .

If , because . For , only when . Thus, if is observed, then because .

Hence, the risk of the predictive density defined by

 {limk→∞pπ′′k(y|x),x∈Xqr(y;x),x∉Xq,

where is an arbitrary predictive density, is not greater than that of for every .

Therefore, by taking a sequence that converges rapidly enough to , we can construct a predictive density

 limk→∞pεk¯μ+(1−εk)π′′k(y|x)={limk→∞pπ′′k(y|x),x∈Xqp¯μ(y|x),x∉Xq (7)

as a limit of Bayesian predictive densities based on priors , where is a measure on such that for every .

Hence, the risk of the predictive density (7) is not greater than that of for every .

2)  In this case, the proof becomes much simpler. We assume that because the assertion is obvious if . Then, and . Thus, we can set in the proof of 1). Furthermore, we can set because for every . Therefore, the desired result can be proved without considering limits of Bayesian predictive densities.

We give two simple examples to clarify the meaning of Theorem 1 and its proof.

Example 1. Suppose that , , , and . Let , which is the plug-in predictive density with the maximum likelihood estimate . Then, , , and . The prior defined by satisfies

 Dq(π(w))=infπ∈PqDq(π)=0.

We set , which satisfies for . Then, we can set because and . Then, . Thus, and .

The prior does not specify the conditional density because . We set and

 π′′k=1k¯μ+(1−1k)π(w).

Then, and . The risk function of the predictive density , which is a limit of the Bayesian predictive densities, is given by

 R(θ,limk→∞pπ′′k(y|x))=⎧⎨⎩0,θ=0∈Θq,∞,θ∈(0,1)=Θ∖Θq,0,θ=1∈Θq

and coincides with .

Example 2. Suppose that , , , , , , , and , where .

Consider a predictive density defined by , , , and . Then, , , , and .

Then, satisfies because except for the case . Since , is not uniquely determined. Thus, we consider a limit of Bayesian predictive densities.

Put . It can be easily verified that satisfies . Then, , ,