Asymptotically Minimax Prediction in Infinite Sequence Models

# Asymptotically Minimax Prediction in Infinite Sequence Models

Keisuke Yano \correflabel=e1]yano@mist.i.u-tokyo.ac.jp [ The University of Tokyo Department of Mathematical Informatics
Graduate School of Information Science and Technology
The University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
Fumiyasu Komaki label=e2]komaki@mist.i.u-tokyo.ac.jp [ The University of Tokyo Department of Mathematical Informatics
Graduate School of Information Science and Technology
The University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
RIKEN Brain Science Institute
2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
###### Abstract

We study asymptotically minimax predictive distributions in infinite sequence models. First, we discuss the connection between prediction in an infinite sequence model and prediction in a function model. Second, we construct an asymptotically minimax predictive distribution for the setting in which the parameter space is a known ellipsoid. We show that the Bayesian predictive distribution based on the Gaussian prior distribution is asymptotically minimax in the ellipsoid. Third, we construct an asymptotically minimax predictive distribution for any Sobolev ellipsoid. We show that the Bayesian predictive distribution based on the product of Stein’s priors is asymptotically minimax for any Sobolev ellipsoid. Finally, we present an efficient sampling method from the proposed Bayesian predictive distribution.

[
\kwd
\runtitle

Prediction in Infinite Sequence Models {aug}
and

class=MSC] \kwd62C10; 62C20; 62G20

Adaptivity \kwdKullback–Leibler divergence \kwdNonparametric statistics \kwdPredictive distribution \kwdStein’s prior

## 1 Introduction

We consider prediction in an infinite sequence model. The current observation is a random sequence given by

 Xi=θi+εWi for i∈N, (1)

where is an unknown sequence in and is a random sequence distributed according to on . Here is a product -field of the Borel -field on the Euclidean space . Based on the current observation , we estimate the distribution of a future observation given by

 Yi=θi+~ε˜Wi for % i∈N, (2)

where is distributed according to . We denote the true distribution of with by and the true distribution of with by . For simplicity, we assume that and are independent.

Prediction in an infinite sequence model is shown to be equivalent to the following prediction in a function model. Consider that we observe a random function given by

 X(⋅)=F(⋅)+εW(⋅) in L2[0,1], (3)

where is the -space on with the Lebesgue measure, is an unknown absolutely continuous function of which the derivative is in , is a known constant, and follows the standard Wiener measure on . Based on the current observation , we estimate the distribution of a random function given by

 Y(⋅)=F(⋅)+~ε˜W(⋅) % in L2[0,1], (4)

where is a known constant, and follows the standard Wiener measure on . The details are provided in Section 2. Xu and Liang  established the connection between prediction of a function on equispaced grids and prediction in a high-dimensional sequence model, using the asymptotics in which the dimension of the parameter grows to infinity according to the growth of the grid size. Our study is motivated by Xu and Liang  and is its generalization to the settings in which the parameter is infinite-dimensional.

Using the above equivalence, we discuss the performance of a predictive distribution of based on in an infinite sequence model. Let be the whole set of probability measures on and let be the decision space . We use the Kullback–Leibler loss as a loss function: for all and all , if is absolutely continuous with respect to , then

 l(θ,Q):=∫logdQθdQ(y)dQθ(y),

and otherwise . The risk of a predictive distribution in the case that the true distributions of and are and , respectively, is denoted by

 R(θ,ˆQ):=∫l(θ,ˆQ(⋅;x))dPθ(x).

We construct an asymptotically minimax predictive distribution that satisfies

 limε→0[supθ∈Θ(a,B)R(θ,ˆQ∗)/infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ)]=1,

where with a known non-zero and non-decreasing divergent sequence and with a known constant . Note that for any , the minimax risk is bounded above by . Further, note that using the above equivalence between the infinite sequence model and the function model, the parameter restriction in the infinite sequence model that corresponds to the restriction that the corresponding parameter in the function model is smooth; represents the volume of the parameter space, and the growth rate of represents the smoothness of the functions.

The constructed predictive distribution is the Bayesian predictive distribution based on the Gaussian distribution. For a prior distribution of , the Bayesian predictive distribution based on is obtained by averaging with respect to the posterior distribution based on . Our construction is a generalization of the result in Xu and Liang  to infinite-dimensional settings. The details are provided in Section 3.

Further, we discuss adaptivity to the sequence and . In applications, since we do not know the true values of and , it is desirable to construct a predictive distribution without using and that is asymptotically minimax in any ellipsoid in the class. Such a predictive distribution is called an asymptotically minimax adaptive predictive distribution in the class. In the present paper, we focus on an asymptotically minimax adaptive predictive distribution in the simplified Sobolev class , where .

Our construction of the asymptotically minimax adaptive predictive distribution is based on Stein’s prior and the division of the parameter into blocks. The proof of the adaptivity relies on a new oracle inequality related to the Bayesian predictive distribution based on Stein’s prior; see Subsection 4.2. Stein’s prior on is an improper prior whose density is . It is known that the Bayesian predictive distribution based on that prior has a smaller Kullback–Leibler risk than that based on the uniform prior in the finite dimensional Gaussian settings; see Komaki  and George, Liang and Xu . The division of the parameter into blocks is widely used for the construction of the asymptotically minimax adaptive estimator; see Efromovich and Pinsker , Cai, Low and Zhao , and Cavalier and Tsybakov . The details are provided in Section 4.

The remainder of the paper is organized as follows. In Section 5, we provide an efficient sampling method for the proposed asymptotically minimax adaptive distribution and provide numerical experiments with a fixed . In Section 6, we conclude the paper.

## 2 Equivalence between predictions in infinite sequence models and predictions in function models

In this section, we provide an equivalence between prediction in a function model and prediction in an infinite sequence model. The proof consists of the two steps. First, we provide a connection between predictions in a function model and predictions in the submodel of an infinite sequence model. Second, we extend predictions in the submodel to predictions in the infinite sequence model.

The detailed description of prediction in a function model is as follows. Let , where denotes the derivative of . Let be the inner product of . Let be the whole set of probability distributions on , where is the Borel -field of . denotes the Kullback–Leibler loss of in the setting that the true parameter function is .

Let be the covariance operator of : for any , . By Mercer’s theorem, there exists a non-negative monotone decreasing sequence and an orthonormal basis in such that

 C(x(⋅))(⋅)=∞∑i=1λi⟨x(⋅),ei(⋅)⟩L2ei(⋅) in L2[0,1].

Explicitly, is and is for .

The detailed description of prediction in the sub-model of an infinite sequence model is as follows. Let be . Note that is a measurable set with respect to , because is the pointwise -limit of and we use Theorem 4.2.2. in Dudley . Let be the whole set of probability distributions on , where is the relative -field of .

The following theorem states that the Kullback–Leibler loss in the function model is equivalent to that in the submodel of the infinite sequence model.

###### Theorem 2.1.

For every and every , there exist and such that

 lF(F,Q)=l(θ,˜Q).

Conversely, for every and every , there exist and such that

 l(θ,˜Q)=lF(F,Q).
###### Proof.

We construct pairs of a measurable one-to-one map and a measurable one-to-one map .

Let be defined by

 Φ(x(⋅)):=⎛⎜ ⎜⎝⟨x(⋅),λ−1/21e1(⋅)⟩L2⟨x(⋅),λ−1/22e2(⋅)⟩L2⋯⎞⎟ ⎟⎠.

is well-defined as a map from to because for and in such that , we have , and because for , we have .

We show that is one-to-one, onto, and measurable. is one-to-one because if , then we have for all . is onto because if , satisfies that . is measurable because is continuous with respect to the norm of and , and because is equal to the Borel -field with respect to . is continuous, because we have

 ρ(Φ(x(⋅)),Φ(y(⋅)))=∞∑i=1(λ−1/2i/2i)|⟨x(⋅),ei(⋅)⟩L2−⟨y(⋅),ei(⋅)⟩L2|∧1.

Further, the restriction of to is a one-to-one and onto map from to .

Let be defined by . is the inverse of . Thus, is one-to-one, onto, and measurable.

Since the Kullback–Leibler divergence is unchanged under a measurable one-to-one mapping, the proof is completed. ∎

###### Remark 2.2.

Mandelbaum  constructed the connection between estimation in an infinite sequence model and estimation in a function model. Our connection is its extension to prediction. In fact, the map is used in Mandelbaum .

The following theorem justifies focusing on prediction in instead of prediction in .

###### Theorem 2.3.

For every and , there exists such that

 l(θ,˜Q)≤l(θ,Q).

In particular, for any subset of ,

 infˆQ∈Dsupθ∈ΘR(θ,ˆQ)=infˆQ∈DDsupθ∈ΘR(θ,ˆQ),

where .

###### Proof.

Note that by the Karhunen–Loève theorem. For such that , and then for any , . For such that ,

 l(θ,Q)=l(θ,˜Q)−logQ(SD)≥l(θ,˜Q),

where is the restriction of to . ∎

## 3 Asymptotically minimax predictive distribution

In this section, we construct an asymptotically minimax predictive distribution for the setting in which the parameter space is an ellipsoid with a known sequence and with a known constant . Further, we provide the asymptotically minimax predictive distributions in two well-known ellipsoids; a Sobolev and an exponential ellipsoids.

### 3.1 Principal theorem of Section 3

We construct an asymptotically minimax predictive distribution in Theorem 3.1.

We introduce notations used in the principal theorem. For an infinite sequence , let be with variance . Then, the posterior distribution based on is

 Gτ(⋅|X=x) = ∞⊗i=1N(1/ε21/ε2+1/τ2ixi,11/ε2+1/τ2i) Pθ-a.s.. (5)

The Bayesian predictive distribution based on is

 QGτ(⋅|X=x) = ∞⊗i=1N(1/ε21/ε2+1/τ2ixi,11/ε2+1/τ2i+~ε2) Pθ-a.s.. (6)

For the derivations of (5) and (6), see Theorem 3.2 in Zhao . Let and be defined by

 v2ε,~ε:=11/ε2+1/~ε2 and v2ε:=ε2, (7)

respectively. Let be the infinite sequence of which the -th coordinate for is defined by

 (τ∗i(ε,~ε))2=12⎡⎣(v2ε−v2ε,~ε)√1+42λ(ε,~ε)a2i(v2ε−v2ε,~ε)−(v2ε+v2ε,~ε)⎤⎦+, (8)

where , and is determined by

 ∞∑i=1a2i(τ∗i(ε,~ε))2=B.

Let be the number defined by

 T(ε,~ε):=sup{i:τ∗i(ε,~ε) is non-zero}=sup{i:1λ(ε,~ε)a2i>2~ε2}. (9)

The following is the principal theorem of this section.

###### Theorem 3.1.

Let be . Assume that . If as and , then

 limε→0⎡⎣{infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ)}/T(ε,~ε)∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε)⎤⎦=1.

Further, the Bayesian predictive distribution based on is asymptotically minimax:

 supθ∈Θ(a,B)R(θ,QGτ=τ∗(ε,~ε))=(1+o(1))infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ)

as .

The proof is provided in the next subsection.

### 3.2 Proof of the principal theorem of Section 3

The proof of Theorem 3.1 requires five lemmas. Because the parameter is infinite-dimensional, we need Lemmas 3.2 and 3.5 in addition to Theorem 4.2 in Xu and Liang .

The first lemma provides the explicit form of the Kullback–Leibler risk of the Bayesian predictive distribution . The proof is provided in Appendix A.

###### Lemma 3.2.

If and , then and are mutually absolutely continuous given -a.s. and the Kullback–Leibler risk of the Bayesian predictive distribution is given by

 R(θ,QGτ)=∞∑i=1{12log(1+τ2i/v2ε,~ε1+τ2i/v2ε)+12v2ε,~ε+θ2iv2ε,~ε+τ2i−12v2ε+θ2iv2ε+τ2i}. (10)

The second lemma provides the Bayesian predictive distribution that is minimax among the sub class of . The proof is provided in Appendix A.

###### Lemma 3.3.

Assume that . Then, for any and any , is finite and is uniquely determined. Further,

 infτ∈l2supθ∈Θ(a,B)R(θ,QGτ) = supθ∈Θ(a,B)infτ∈l2R(θ,QGτ) = supθ∈Θ(a,B)R(θ,QGτ=τ∗(ε,~ε)) = T(ε,~ε)∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε).

The third lemma provides the upper bound of the minimax risk.

###### Lemma 3.4.

Assume that . Then, for any and any ,

 infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ)≤T(ε,~ε)∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε).
###### Proof.

Since the class is included in , the result follows from Lemma 3.3. ∎

We introduce the notations for providing the lower bound of the minimax risk. These notations are also used in Lemma 4.2. Fix an arbitrary positive integer . Let be . Let be . Let and be and , respectively. Let be the -dimensional parameter space defined by

 Θ(d)(a,B):={θ(d)=(θ1,…,θd):d∑i=1a2iθ2i≤B}.

Let be the -dimensional Kullback–Leibler risk

of predictive distribution on . Let be the minimax risk

 Rd(Θ(d)(a,B)):=infˆQ(d)∈D(d)supθ(d)∈Θ(d)(a,B)Rd(θ(d),ˆQ(d)),

where is with the whole set of probability distributions on .

The fourth lemma shows that the minimax risk in the infinite sequence model is bounded below by the minimax risk in the finite dimensional sequence model. The proof is provided in Appendix A.

###### Lemma 3.5.

Let be any positive integer. Then, for any and any ,

 infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ) ≥ Rd(Θ(d)(a,B)).

The fifth lemma provides the asymptotic minimax risk in a high-dimensional sequence model. It is due to Xu and Liang .

###### Lemma 3.6 (Theorem 4.2 in Xu and Liang ).

Let be defined by (8). Let be defined by (9). Let be where . Assume that . If as and , then

 limε→0⎡⎣Rd(ε)(Θ(d(ε))(a,B))/T(ε,~ε)∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε)⎤⎦=1.

Based on these lemmas, we present the proof of Theorem 3.1.

###### Proof of Theorem 3.1.

From Lemma 3.4,

 infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ)≤T(ε,~ε)∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε).

From Lemma 3.5 with and Lemma 3.6,

 infˆQ∈Dsupθ∈Θ(a,B)R(θ,ˆQ)≥(1−o(1))T(ε,~ε)∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε).

This completes the proof. ∎

### 3.3 Examples of asymptotically minimax predictive distributions

In this subsection, we provide the asymptotically minimax Kullback–Leibler risks and the asymptotically minimax predictive distributions in the case that is a Sobolev ellipsoid and in the case that it is an exponential ellipsoid.

#### 3.3.1 The Sobolev ellipsoid

The simplified Sobolev ellipsoid is with and . We set for . This setting is a slight generalization of Section 5 of Xu and Liang , in which the asymptotic minimax Kullback–Leibler risk with is obtained.

We expand and . From the definition of , we have . Thus, we have

 2B = ε2T2α+1γ2+1[∫10x2α√1+4γ2(γ2+1)x−2αdx−2γ2+12α+1](1+o(1)),

where we use the convergence of the Riemann sum with the function . Then,

 T(ε,~ε) =(Bε2)1/(2α+1)⎡⎢⎣2(γ2+1)∫10x2α√1+4γ2(γ2+1)x−2αdx−2γ2+12α+1⎤⎥⎦1/(2α+1) ×(1+o(1)) (11)

and

 (τ∗i(ε,~ε))2=ε22⎡⎣1γ2+1√1+4γ2(γ2+1)(iT)−2α−2γ2+1γ2+1⎤⎦+(1+o(1)). Figure 1: Convergence constant limε→0infˆQ∈Dsupθ∈ΘSobolev(α,B)2ε−2/(2α+1)R(θ,ˆQ) with α=1 and B=1: The red line denotes the convergence constant where A is the whole set of probability distributions and the black line denotes the convergence constant where A is the whole set of plug-in predictive distributions.

Thus, we obtain the asymptotically minimax risk

 infˆQ∈Dsupθ∈Θ(α,B) R(θ,ˆQ) =T∑i=112log(1+(τ∗i(ε,~ε))2/v2ε,~ε1+(τ∗i(ε,~ε))2/v2ε) =TT∑i=112log⎛⎜ ⎜ ⎜ ⎜⎝1+1γ2+2γ2(γ2+1)√1+4γ2(γ2+1)(i/N)−2α−(2γ2+1)⎞⎟ ⎟ ⎟ ⎟⎠1T =T2∫10log⎛⎜ ⎜ ⎜ ⎜⎝1+1γ2+2γ2(γ2+1)√1+4γ2(γ2+1)x−2α−(2γ2+1)⎞⎟ ⎟ ⎟ ⎟⎠dx(1+o(1)) =(Bε2)1/(2α+1)P∗(1+o(1)), (12)

where

 P∗ = 12⎡⎢⎣2(γ2+1)∫10x2α√1+4γ2(γ2+1)x−2αdx−2γ2+12α+1⎤⎥⎦1/(2α+1) ×∫10log⎛⎜ ⎜ ⎜ ⎜⎝1+1γ2+2γ2(γ2+1)√1+4γ2(γ2+1)x−2α−(2γ2+1)⎞⎟ ⎟ ⎟ ⎟⎠dx.

We compare the Kullback–Leibler risk of the asymptotically minimax predictive distribution with the Kullback–Leibler risk of the plug-in predictive distribution that is asymptotically minimax among all plug-in predictive distributions. The latter is obtained using Pinsker’s asymptotically minimax theorem for estimation (see Pinsker ). We call the former and the latter risks the predictive and the estimative asymptotically minimax risks, respectively. The orders of and in the predictive asymptotic minimax risk are both the -th power. These orders are the same as in the estimative asymptotically minimax risk. However, the convergence constant and the convergence constant in the estimative asymptotically minimax risk are different. Note that the convergence constant in the estimative asymptotically minimax risk is the Pinsker constant multiplied by . Figure 1 shows that the convergence constant becomes smaller than the convergence constant in the estimative asymptotically minimax risk as increases. Xu and Liang  also pointed out this phenomenon when .

#### 3.3.2 The exponential ellipsoid

The exponential ellipsoid is , with and . We set for .

We expand and . From the definition of , we have . Thus,

 2B = Ne2αTε2(γ2+1)r(α,γ)(1+o(1)),

where is a bounded term with respect to