Beyond Binomial and Negative Binomial: Adaptation in Bernoulli Parameter Estimation This material is based upon work supported in part by the US National Science Foundation under Grant No. 1422034 and Grant No. 1815896, and by the DARPA REVEAL program under Contract No. HR0011-16-C-0030. The authors are with the Department of Electrical and Computer Engineering, Boston University, Boston, MA 02215 USA.

# Beyond Binomial and Negative Binomial: Adaptation in Bernoulli Parameter Estimation ††thanks: This material is based upon work supported in part by the US National Science Foundation under Grant No. 1422034 and Grant No. 1815896, and by the DARPA REVEAL program under Contract No. HR0011-16-C-0030.††thanks: The authors are with the Department of Electrical and Computer Engineering, Boston University, Boston, MA 02215 USA.

Safa C. Medin, John Murray-Bruce, David Castañón, and Vivek K Goyal
###### Abstract

Estimating the parameter of a Bernoulli process arises in many applications, including photon-efficient active imaging where each illumination period is regarded as a single Bernoulli trial. Motivated by acquisition efficiency when multiple Bernoulli processes are of interest, we formulate the allocation of trials under a constraint on the mean as an optimal resource allocation problem. An oracle-aided trial allocation demonstrates that there can be a significant advantage from varying the allocation for different processes and inspires a simple trial allocation gain quantity. Motivated by realizing this gain without an oracle, we present a trellis-based framework for representing and optimizing stopping rules. Considering the convenient case of Beta priors, three implementable stopping rules with similar performances are explored, and the simplest of these is shown to asymptotically achieve the oracle-aided trial allocation. These approaches are further extended to estimating functions of a Bernoulli parameter. In simulations inspired by realistic active imaging scenarios, we demonstrate significant mean-squared error improvements up to 4.36 dB for the estimation of and up to 1.80 dB for the estimation of .

adaptive sensing, Bernoulli processes, beta distribution, coding gain, computational imaging, conjugate prior, dynamic programming, greedy algorithm, low-light imaging, photon counting, total-variation regularization
\setstackEOL

## I Introduction

Estimating the parameter of a Bernoulli process is a fundamental problem in statistics and signal processing. From the binary-valued outcomes of independent and identically distributed (i.i.d.) trials (generically failure (0) or success (1)), we wish to estimate the probability of success. Among myriad applications, our primary interest is raster-scanned active imaging in which a scene patch is periodically illuminated with a pulse, and each illumination period (Bernoulli trial) either has a photon-detection event (success) or not (failure) [1]. The probability of photon-detection event has a monotonic relationship with the reflectivity of scene patches, and a monotonic function of an estimate of becomes the corresponding image pixel value. For efficiency in acquisition time or illumination energy, we are motivated to form the image accurately from a small number of illumination pulses, under conditions where is small.111For applications using time-correlated single photon counting driven by a detector with dead time, such as a single-photon avalanche diode (SPAD), it is recommended to keep below to avoid time skew and missed detections [2]. Other types of raster-scanned imaging can be modeled similarly assuming that the dwell time is an integer multiple of some base time interval, during which the observations are binary.

Conventional methods are not adaptive. With a fixed number of trials , the number of successes is a binomial random variable, and the maximum likelihood (ML) estimate of is . Though less common in active imaging, a well-known alternative in the statistics literature is to fix a number of successes . Repeating trials until success occurs results in a random number of trials that is a negative binomial random variable,222Note that the negative binomial distribution is defined inconsistently in the literature, with sometimes the number of failures being fixed rather than the number of successes. and the ML estimate of is . While there may seem to be nothing to design here, a constraint on the mean number of trials opens up possibilities for data acquisition that results in neither binomial nor negative binomial distributions. The mean may be over a multiplicity of (non-random) Bernoulli process parameters to estimate (such as in active imaging with one parameter per pixel) or over a prior for a single Bernoulli parameter. The two cases are formally linked through the relative frequency interpretation of probability, with the empirical distribution of the multiplicity of deterministic parameters in the former case playing the role of the prior distribution in the latter case [3]. For multiple deterministic parameters, we have a resource allocation problem reminiscent of bit allocation in transform coding [4, 5]. As we will demonstrate, in an oracle-aided setting, trials can be allocated to maximize a trial allocation gain that is analogous to the coding gain of transform coding. For a single random parameter, a simple and implementable approach – not requiring an oracle – asymptotically achieves the optimal trial allocation gain and may perform better than the oracle-aided method for moderate numbers of trials.

The focus of this paper is on allocating trials in the estimation of a single random parameter through the design of a stopping rule. A stopping rule may – implicitly and stochastically – allocate trials differently for different values of , even though is not known a priori. We show that any optimal stopping rule can be described by a trellis rather than a more complicated graph, and greedy construction of the trellis is very nearly optimal. For a rectangular array of Bernoulli processes representing a scene in an imaging problem, applying a good stopping rule allocates more trials to the pixels where they provide the most benefit. The final image formation may include a method such as total variation (TV) regularization for exploiting spatial correlations among neighbors. In simulations with parameters realistic for active optical imaging, we demonstrate a reduction in mean-squared error (MSE) by a factor of 2.67 (4.26 dB) in comparison to the same regularized reconstruction approach applied without adaptation in numbers of trials. Such gains vary based on image content, and gains without regularization are predictable from the trial allocation gain formulation.

### I-a Related Work

#### I-A1 Statistics Literature

In statistics, forming a parameter estimate from a number of i.i.d. observations that is dependent on the observations themselves is called sequential estimation [6]. Early interest in sequential estimation of a Bernoulli process parameter was inspired by the high relative error of deterministically stopping after trials when is small. Specifically, the standard error of the ML estimate is , which for small is unfavorable compared to anything proportional to . This shortcoming manifests, for example, in requiring large to distinguish between two small possible values for .

Haldane [7] observed that if one stops after successes, the (random) number of trials is informative about . Then is an unbiased estimate of (provided ), and its standard error is proportional to (provided ). (The ML estimate is not unbiased, though is an unbiased estimate of .) Tweedie [8] suggested to call this inverse binomial sampling, but the resulting random variable is now commonly known as negative binomial or Pascal distributed. More recent works have focused on non-MSE performance metrics [9, 10], estimation of functions of  [11], estimation from imperfect observations [12], and composite hypothesis testing [13].

#### I-A2 Photon-Efficient Imaging and Variable Dwell Time

First-photon imaging [14] introduced sequential estimation to active imaging. This method uses the number of illumination pulses until the first photon is detected to reveal information about reflectivity, setting in the concept of Haldane [7] and thus using geometric sampling as a special case of negative binomial sampling. A censoring method is used to approximately separate signal and background detections, and spatial correlations are used to regularize the estimation of the full scene reflectivity image, resulting in good performance from only 1 detected photon per pixel, even when half of the detected photons are attributable to uninformative ambient light. Subsequent work with binomial sampling (and otherwise identical experimental conditions) resulted in similar performance [1], and greatly increasing robustness to ambient light is largely attributable to improving the censoring step [15]. These works leave questions on the importance of negative binomial sampling to first-photon imaging unanswered; comparing first-photon imaging to photon-efficient methods with deterministic dwell time [16, 17, 1, 18, 19, 20, 21, 22, 23, 15, 24, 25] was an initial inspiration for the present work.

While recent works have exploited the first-photon idea in imaging techniques such as ghost imaging [26, 27] and x-ray tomography [28], previous uses of variable dwell time are not closely connected to sequential estimation or the result of optimized resource allocation. For example, in LIDAR, varying dwell time to maintain approximately constant signal strength despite varying effective reflectivity (including greater radial fall-off for more distant scene patches) dates back to at least the 1970s [29]. He et al. [30] closely follow the technique of [1], including its background censoring, and vary the dwell time to keep the number of photon detections after censoring (i.e., photon detections attributed to signal rather than background) at each pixel approximately constant. In scanning electron microscopy, Dahmen et al. [31] increase dwell time where a measure of image detail is large. To the best of our knowledge, no previous paper has formally optimized dwell time under a Bernoulli process measurement model.

### I-B Main Contributions and Preview of Results

#### I-B1 Framework

This work discusses a novel framework for depicting and understanding stopping rules for sequential estimation of Bernoulli parameters under number of trials constraints (Section III). In this framework, first presented in [32], each Bernoulli trial corresponds to a transition in a trellis in which each node is identified by the number of trials and number of successes; it is easily shown that distinct paths to reach a given node need not be distinguished. A stopping rule is the assignment of probabilities of stopping to each node in the trellis (see Figs. 46). By construction, a stopping rule defined in this way is implementable because it does not depend on knowledge of or non-causally on the Bernoulli process.

#### I-B2 Stopping Rule Design

Simple stopping rules lead to binomial (Fig. 5(a)) and negative binomial (Fig. 5(b)) sampling. Methods to optimize the stopping rule are presented in decreasing order of computational complexity: dynamic programming (Section IV-A), offline greedy design (Section IV-B), and online threshold-based termination (Section IV-C) first introduced in [32]. Empirically, all three methods, including the online method requiring no storage of a precomputed stopping rule, provide very similar performance. Thus, the easily implementable online method provides very nearly optimal performance.

#### I-B3 Analysis in Oracle-Aided Setting

This paper introduces the concept of oracle-aided trial allocation whereby processes with different parameters are allocated different fractions of an overall trial budget (Section II-A). This yields a readily-computed trial allocation gain that can be arbitrarily large, though it is generally modest (Section II-B). Furthermore, we show that the threshold-based stopping asymptotically allocates trials identically to the oracle-based optimal (Section IV-E).

#### I-B4 Evaluation

In simulations inspired by realistic active imaging scenarios, an MSE improvement factor of is demonstrated, where spatial correlations are exploited through total variation (TV) regularization (Section VI-A). Without TV regularization, the gain is , close to the gain predicted by the theoretical trial allocation gain.

#### I-B5 Estimating functions of p

Inspired by applications where estimating functions of is of interest [11], online threshold-based termination is also extended to estimating from Bernoulli observations (Section V). Experimental results without TV regularization demonstrate an improvement of  dB using the threshold-based stopping rule versus the conventional binomial sampling (Section VI-B).

## Ii Trial Allocation Across Multiple Processes

Consider the estimation of the parameters of a finite number of Bernoulli processes with binomial sampling of each process. When trials of process are observed, the MSE of the ML estimate of is . Suppose that we are interested in making the average of the MSEs,

 1rr∑i=1pi(1−pi)mi,

small under a constraint on the average of the numbers of trials .

Since varies over for , there can be an advantage to varying the values. However, that allocation of trials depends on parameters that are to be estimated. In this section, we suspend the need for implementability and instead study the optimal trial allocation as if the parameters were known. This provides a benchmark for the implementable methods developed in the remainder of the paper, with playing the role of a discrete prior on . We also consider to reach a distributional limit.

### Ii-a Oracle-Aided Optimal Allocation

In optimizations such as

 minmi, i=1,2,…,rr∑i=1pi(1−pi)mi  s.t.  r∑i=1mi≤rη, (1)

ignoring that each should be a positive integer, each MSE vs. number of trials trade-off should be at the same slope, else it would be advantageous to shift trial resources to the process for which the benefit (MSE reduction) per trial is largest. This is formalized using the method of Lagrange multipliers. The resulting optimal allocation is

 m∗i=rη√pi(1−pi)∑rj=1√pj(1−pj),i=1,2…,r. (2)

Since each process has a fixed number of trials , independent of the experimental outcome of each trial, we call using these numbers of trials oracle-aided binomial sampling.

###### Example 1 (Oracle-aided allocations)
1. Let and . Then the fractional oracle-aided allocations are

 m∗12η=√ε(1−ε)√ε(1−ε)+1/2,m∗22η=1/2√ε(1−ε)+1/2.

These are plotted as functions of in Fig. 1(a).

2. Let and . Then the fractional oracle-aided allocations are

 m∗irη = √ε(1−ε)(r−1)√ε(1−ε)+1/2,i=1,2,…,r−1, m∗rrη = 1/2(r−1)√ε(1−ε)+1/2.
3. Let , . The fractional oracle-aided allocations are plotted for in Fig. 1(b).

### Ii-B Trial Allocation Gain

Using the oracle-aided allocations (2) reduces the average MSE relative to a constant allocation . The constant allocation results in the average MSE

 1rr∑i=1pi(1−pi)η, (3)

whereas using (2) yields

 1rr∑i=1pi(1−pi)m∗i = 1rr∑i=1pi(1−pi)√pi(1−pi)1rηr∑j=1√pj(1−pj) (4) = 1r2η(r∑j=1√pj(1−pj))2.

We define the ratio of (3) and (4) as the trial allocation gain:

 γalloc=r∑ri=1pi(1−pi)(∑rj=1√pj(1−pj))2. (5)

Trial allocation gain is reminiscent of the coding gain in transform coding [4, 5].

###### Example 2 (Trial allocation gains)
1. For the parameters in Example 1(a), the trial allocation gain is plotted as a function of in Fig. 2. Notice that in the limit of , all the trials are allocated to the nontrivial Bernoulli process, doubling its number of trials, which halves the average MSE. Thus .

2. For the parameters in Example 1(b), .

3. For the parameters in Example 1(c), the trial allocation gain is plotted as a function of in Fig. 2(b).

4. Fig. 3(a) shows the “’Modified Shepp–Logan phantom” provided by the Matlab phantom command, at size and scaled to . Fig. 3(b) shows a histogram of the intensity values of the phantom. Evaluating (5) gives 1.6944, or 2.29 dB.

One can show that . The upper bound is illustrated in part (b) of the example.

### Ii-C Distributional Limit

Suppose now that Bernoulli process parameter is modeled with random variable and the number of trials is to be assigned by an oracle (i.e., it is allowed to depend on the realization of ) to minimize the MSE of the ML estimate of under the constraint . By analogy to the computations giving (2) – or formally taking a limit of with the empirical distribution of converging to the distribution of  – the number of trials should be assigned based on how large is relative to :

 M=η√P(1−P)E[√P(1−P)]. (6)

The resulting trial allocation gain is

 γalloc=E[P(1−P)](E[√P(1−P)])2. (7)

This can also be written as

 γalloc=1+var(V)(∗E[V])2, (8)

where . If follows that , with equality if and only if the random variable has zero variance.

###### Example 3 (Trial allocation gains – random parameter)
1. Let have the continuous uniform distribution on . Then evaluating (7) gives . This value is the asymptote in Fig. 2(b).

2. Let take two values: with probability and with probability . Then and . Substituting in (7) gives . We can interpret this with relative frequencies: Since requires no trials, fraction of the time, will occur and should be allocated times the mean number of trials.

3. When holds, . Therefore, (8) becomes approximately invariant to rescaling. For example, rescaling the phantom in Example 2(d) by a factor of to gives , and by a factor of to gives ; these are small changes from the value in Example 2(d).

The first two parts of the example show that though an allocation gain may typically be modest, it may also be arbitrarily large. The third part shows that allocation gain is approximately dependent on the coefficient of variation of the Bernoulli parameter, provided that the parameter is known to be small.

Having established that varying the numbers of trials can be beneficial, we now turn our attention to methods that do not depend on an oracle. We will compare to the oracle-aided allocations in certain asymptotic settings.

## Iii Observation of a Single Bernoulli Process

Let be a Bernoulli process with an unknown random parameter , and let be a trial budget. A stopping rule consists of a sequence of continuation probability functions

 πn:{0,1}n→[0,1],n=0,1…, (9)

that give the probability of continuing observations after trial  – based on a biased coin flip independent of the Bernoulli process – as a function of . The result is a random number of observed trials .333The time does not satisfy the standard definition of a stopping time when the stopping rule is randomized because randomness independent of the sequence of outcomes is allowed to influence the decision of whether or not to continue observations. The stopping rule is said to satisfy the trial budget when . It is said to be deterministic when every takes values only in and it is said to be randomized otherwise. A randomized stopping rule can be seen as stochastic multiplexing among some number of deterministic stopping rules.

Our goal is to minimize the MSE in estimation of through the design of a stopping rule that satisfies the trial budget and of an estimator . We will first show that the continuation probability functions can be simplified greatly with no loss of optimality. Then, we will provide results on optimizing the stopping rule under a Beta prior on .

### Iii-a Framework for Data-Dependent Stopping

Based on (9), a natural representation of a stopping rule is a node-labeled binary tree representing all sample paths of the Bernoulli process, with a probability of continuation label at each node. This representation has labels for observation sequences up to length . However, the tree can be simplified to a trellis without loss of optimality. Conditioned on observing successes in trials, all sequences of length with successes are equally likely. Thus, regardless of the prior on , no improvement can come from having unequal continuation probabilities for tree nodes all representing successes in trials. Instead, these nodes should be combined, therefore reducing the tree to a trellis. This representation has labels for observation sequences up to length . The continuation probability functions are reduced to a set of probabilities for continuing after successes in trials, as depicted in Fig. 4.

Trellises alone – without labels – give a simple representation for both data-dependent and data-independent deterministic stopping rules: Hence, we begin with some related terminology that will be used throughout this paper.

###### Definition 1 (Complete trellis)

A complete trellis of depth contains all nodes belonging to the set

 Td={(k,m):k=0,1,…,m; m=0,1,…,d}. (10)
###### Definition 2 (Strategy)

Any is a strategy when all nodes in are connected and contains the root node .

Henceforth, we restrict our attention to strategies and stochastic multiplexing among strategies. The stopping rule prescribed by the strategy is

 qk,m(T)={1,  v=(k,m)∈T;0,  otherwise. (11)

### Iii-B Standard Sampling Methods and their Representations

The conventional use of a fixed number of trials corresponds to continuation probabilities

 qk,m={1,  m

Regardless of the sample path, one observes exactly trials, and the number of successes is a random variable. We refer to this as binomial sampling or the binomial stopping rule. An example of the corresponding trellis representation for a fixed number of trials is shown in Fig. 5(a).

The technique analyzed by Haldane [7] and employed in first-photon imaging [14] with can be expressed with continuation probabilities

 qk,m={1,  k<ℓ;0,  otherwise. (13)

Observations cease with successes in trials, where is a random variable. We call such a strategy the negative binomial stopping rule, or geometric stopping rule for the special case where . The trellis representation of the negative binomial stopping rule for is shown in Fig. 5(b).

In general, observations cease with successes in trials, where and are both random variables. Importantly, the i.i.d. nature of a Bernoulli process makes the pair contain all the information that is relevant from the sequence of observations. Similarly to the reduction from tree to trellis, conditioned on , all sequences of length with successes are equally likely, so the specific sequence among these is uninformative about .

### Iii-C Analysis Under Beta Prior

Our method for optimizing the design of continuation probabilities is through analyzing mean Bayes risk reduction from continuation. We define risk function as squared error or squared loss

 L(p,ˆp)=(p−ˆp)2,

where is the unknown Bernoulli parameter and is the estimate of this parameter. The Bayes risk is defined as

 R(ˆp)=E[L(p,ˆp)]=E[(p−ˆp)2],

which in this case is the MSE. Using the minimum MSE (MMSE) estimator, for which , the Bayes risk is the variance of the posterior distribution. Thus, key to the optimization is to track posterior variances through the trellis. While this could be done for any prior on , here we consider only the mathematically convenient case of choosing a conjugate prior.

#### Iii-C1 Beta Prior

The Beta distribution is the conjugate prior for Bernoulli, binomial, and negative binomial distributions. When has the distribution with probability density function

 fP(p;a,b)=Γ(a+b)Γ(a)Γ(b)pa−1(1−p)b−1,

where is the gamma function, the posterior distribution after observing successes in trials has the distribution. The beta distribution has mean

 μa,b=E[P]=aa+b (14)

and variance

 σ2a,b=var(P)=ab(a+b)2(a+b+1). (15)

#### Iii-C2 Expected Number of Trials

For the stopping rule represented by the trellis , the expected number of trials is the weighted sum of the depths of all stopping (or leaf) nodes with weights corresponding to probability of reaching that node, under the initial prior. For a trellis and initial prior , the probability of reaching any node can be expressed recursively using the probabilities of reaching its parents, and . Conditioned on reaching , the probability of reaching is the product of continuation probability and success probability

 μα+k−1,β+m−k=α+k−1α+β+m−1; (16a) similarly, conditioned on reaching (k,m−1), the probability of reaching (k,m) is the product of continuation probability qk,m−1(T) and failure probability 1−μα+k,β+m−k−1=β+m−k−1α+β+m−1. (16b)

Hence, we have the recursion

 uk,m(T)= uk−1,m−1(T)qk−1,m−1(T)α+k−1α+β+m−1 +uk,m−1(T)qk,m−1(T)β+m−k−1α+β+m−1 (17)

for the probability of reaching node . The recursion is initialized with . Since , it suffices to compute up to .

Using from (17), the expected number of trials incurred by a strategy starting with a prior is

 hα,β(T)=∑v∈Td+1∖Tmuk,m(T). (18)

The nonzero terms in the sum correspond to the reachable leaf nodes, which are all contained in .

#### Iii-C3 Expected Bayes Risk

Under initial prior , the Bayes risk of the estimate of from observations leading to node is given by (15), with and . A strategy has expected Bayes risk given by the sum of the Bayes risks of nodes with zero continuation probability weighted by the probabilities of reaching that node:

 gα,β (T)=∑v∈T′uk,m(T)σ2α+k,β+m−k =∑v∈T′uk,m(T)(α+k)(β+m−k)(α+β+m)2(α+β+m+1). (19)

#### Iii-C4 Optimization Problem Statement

With the proposed trellis-based framework, finding an optimal deterministic stopping rule (in the MSE sense) under an average budget constraint becomes a set minimization problem:

 T∗=argminT∈2Tdgα,β(T)subject to hα,β(T)≤η. (20)

Implementable solutions to (20), with varying complexities and deviations from optimality, are presented in the subsequent section. We seek only solutions on the lower convex hull of the trade-off between and . Stochastic multiplexing among these solutions gives optimal randomized stopping rules.

## Iv Stopping Rule Design

### Iv-a A Dynamic Programming Solution

For a fixed and sufficiently large , total enumeration of the entire solution space is a possible approach for solving (20) to find an optimal deterministic stopping rule. However, the combinatorial structure of the problem means that evaluating the Bayes risks (19) and expected numbers of trials (18) for all possible strategies can be computationally prohibitive, even for moderate trial budgets; this precludes full enumeration.

Conversely, one could start at the leaf nodes of a complete trellis (with depth ), traverse the trellis towards its root, whilst deciding whether each visited node merits inclusion in the optimized solution. This is the basis of a dynamic programming (DP) solution: it solves our optimization problem that involves making a sequence of decisions by determining, for each decision, subproblems that can be solved in a similar fashion [33]. As such, a solution of the original problem can be found from solutions of subproblems.

Precisely, we first relax (20) by writing its Lagrangian formulation:

 minT∈2Tdgα,β(T)+λhα,β(T), (21)

where , can be viewed as the desired MSE reduction per additional trial. We introduce three compact notations associated with node :

 Rα,βk,m=σ2α+k,β+m−k=(α+k)(β+m−k)(α+β+m)2(α+β+m+1) (22a) is the mean Bayes risk conditioned on stopping at v, Sα,βk,m=μα+k,β+m−k=α+kα+β+m (22b) is the probability of the next trial being a success, and Fα,βk,m=1−μα+k,β+m−k=β+m−kα+β+m (22c)

is the probability of the next trial being a failure. The dynamic program summarized in Algorithm 1 iteratively constructs a solution to (21) by comparing to the lowest cost achievable from the state that results after a single trial. If is lower, then an additional trial is not warranted and the node is eliminated, i.e. and . Because of the decomposability of the problem, the solutions are optimal.

### Iv-B A Greedy Algorithm

The DP method (Algorithm 1) prunes from the complete trellis . Monotonicity of the objective and cost can be exploited to develop a lower-complexity greedy algorithm that instead builds a trellis starting from just the root node.

The scheme outlined in Algorithm 2 monotonically improves the objective function value for the minimization problem (20) with each iteration. Specifically, at iteration , the greedy decision is to add to the current trellis a node that yields the largest reduction in the Bayes risk per additional trial,

 gα,β(Ti∪v)−gα,β(Ti)hα,β(Ti∪v)−hα,β(Ti),

without violating the mean number of trials constraint. The scheme terminates when no such node exists.

### Iv-C Online Threshold-Based Termination

Our final method applies a simple rule for termination of trials, depending on the prior parameters and the position in the trellis. It implies a trellis design, but it does not require storage of a designed trellis.

Suppose a sequence of trials reaches a node corresponding to the posterior distribution . Denote the mean Bayes risk without performing an additional trial by

 Rstop(a,b)=σ2a,b, (23)

using the variance given in (15). When one additional trial is performed, the posterior distribution is either if the outcome of the additional trial is a success, or if the outcome of the additional trial is a failure. Therefore, the mean Bayes risk resulting from continuing with one additional trial is

 Rcont(a,b)= E[(1−P)σ2a,b+1+Pσ2a+1,b] = ab(a+b)(a+b+1)2. (24)

The Bayes risk reduction from one additional trial is

 ΔR(a,b)= Rstop(a,b)−Rcont(a,b) = ab(a+b)2(a+b+1)2. (25)

Recall that, starting from a prior, upon reaching node , the posterior is . The Bayes risk reduction from an additional trial,

 ΔR(k,m;α,β)=(α+k)(β+m−k)(α+β+m)2(α+β+m+1)2, (26)

can be the basis of an online stopping rule. Let denote a specified threshold value for the reduction in Bayes risk that justifies an additional trial. Then stopping based on this threshold induces the probabilities of continuing at each node of the trellis given by

 qk,m={1,  ΔR(k,m;α,β)≥Δmin;0,  ΔR(k,m;α,β)<Δmin. (27)

Fig. 6(a) shows values of for . The choice of threshold results in the trellis shown in Fig. 6(b).

Notice that for a fixed trellis depth , the denominator of (26) is fixed, and the numerator of (26) is a product of factors with fixed sum that is equal to . Thus, from the arithmetic–geometric mean inequality, is largest where the posterior distribution is most symmetric. This is apparent in the example in Fig. 6(b); since we have started with a uniform prior, the center of each row represents a symmetric posterior, and additional observations are most merited near the center of each row. Starting with a highly asymmetric prior ( or ), the same principle explains an asymmetry in the greedily optimized trellis of continuation probabilities.

###### Example 4 (Suboptimality of binomial sampling)

Suppose we have a (i.e., uniform) prior. Then (26) simplifies to

 ΔR(k,m;1,1)=(k+1)(m−k+1)(m+2)2(m+3)2.

For the threshold-based termination to induce binomial sampling with trials, the incremental benefit at must be greater than at :

 m∗+1(m∗+2)2(m∗+3)2 ≥(⌊12(m∗+1)⌋+1)(m∗−⌊12(m∗+1)⌋+2)(m∗+3)2(m∗+4)2. (28)

Since (28) fails to hold for any , threshold-based termination induces binomial sampling only for and trials. This is consistent with Fig. 6. For such a small trial budget, full enumeration of stopping rules is also feasible, and one can conclude that binomial sampling is indeed suboptimal for any trial budget greater than 2. Similar arguments can be made for non-uniform beta priors.

### Iv-D Comparisons of Designs

Sweeping in threshold-based termination is very similar to sweeping in Algorithm 1; it will achieve certain mean numbers of trials, similar to sweeping in Algorithm 2. Intermediate values of the mean number of trials can be achieved by finding such that is largest among those below and varying over . This idea is used to enforce an equal expected number of trials for trellises optimized with each method, thus allowing a fair comparison of their Bayes risks. For a mean number of trials , Algorithms 1 and 2 were found to give exactly the same trellis, while online threshold-based termination gave a slightly different trellis with slightly higher mean Bayes risk. Fig. 7 illustrates the difference in values. It is zero for the vast majority of nodes, with 24 nodes at which the DP-designed trellis terminates but the threshold-based rule does not (red, ), and 54 nodes at which the threshold-based rule terminates but the DP-designed trellis does not (blue, ).444The mean number of trials is equal. To be convinced that the blue and red nodes can balance, note that while there are more blue nodes, they are for larger values of and thus have lower probabilities of being reached.

Illustrated in Fig. 8 is a comparison of our three proposed implementable strategies, applied for a uniform prior, over a range of trial budgets. MSEs of DP (Algorithm 1) and the greedily optimized trellis (Algorithm 2) coincide for all trial budgets because the trellises are identical – though we have not proven that this is guaranteed. The online threshold-based stopping rule is only very slightly worse by a factor of at most 1.000195 (less than 0.001 dB).

The phenomenon of more trials being merited when is near counteracts the MSE of being largest for near . This is illustrated in Fig. 9(a), which shows mean numbers of trials allocated as a function of . We have optimized for MSE averaged over and, in so doing, obtained a modest improvement factor of in this average, comparing the online threshold-based termination to conventional binomial sampling. A more significant reduction in the worst-case MSE is a by-product of the optimization (see Fig. 9(b)).

### Iv-E Asymptotic Comparison with Oracle-Aided Allocation

Considering the non-degenerate cases , the threshold-based termination is asymptotically equivalent to oracle-aided optimal allocation. For a large trial budget , we will find an approximation for , the number of trials at which the online threshold-based rule terminates. This will match the form of (2) or (6).

Using (26), for an initial prior, the online rule continues at node if and only if

 (α+k)(β+m−k)(α+β+m)2(α+β+m+1)2≤Δmin. (29)

Since the trial budget is large and , , , and are all large when nearing termination. Hence, we approximate the expression in (29) as

 (α+k)(β+m−k)(α+β+m)2(α+β+m+1)2 =m2(α/m+k/m)(β/m+1−k/m)m4(α/m+β/m+1)2(α/m+β/m+1+1/m)2 ≈(k/m)(1−k/m)m2 =ˆpML(1−ˆpML)m2, (30)

where is the ML estimate of .

Substituting (30) into (29), we obtain

 m∘≈√ˆpML(1−ˆpML)Δmin. (31)

By the law of large numbers, , so (31) shows a match to (2), with determining the trial budget. Furthermore, by comparison with (6), we see an equivalence by choosing .

Fig. 9(a) illustrates an example of the approximate match between threshold-based termination and oracle-aided sampling that is predicted by the match among (2), (6), and (31). Note that convergence is not uniform in ; a larger trial budget is needed to observe approximate equivalence in allocations for near and near .

Fig. 10 shows the variation of the MSEs with mean number of trials budget constraint for conventional binomial sampling, threshold-based termination, and oracle-aided allocation. The results are based on Monte Carlo simulations, with MATLAB, using the phantom image in Fig. 3(a). As expected the optimized rules consistently achieve MSE improvements over the conventional binomial sampling, for all simulated trial budgets. In addition, when compared to the unrealizable oracle-aided method, the threshold-based approach only marginally under-performs at moderate mean number of trials budget constraints. This observation is further underscored in Fig. 9, which show significant overlap between threshold-based termination and oracle-aided allocation, in terms of both trial allocations and the resulting MSEs.

In line with the earlier asymptotic analysis, the threshold-based termination and oracle-aided performances coincide for moderate to high mean numbers of trials, independent of the prior. Under a highly skewed prior consistent with the true distribution of the phantom pixels, Fig. 10(b) demonstrates that it is possible for online threshold-based termination to even outperform the oracle-aided binomial method, at low trial budgets. This phenomenon is attributable to the online method allocating more trials when the Bernoulli process realization has a relatively high fraction of successes. Put simply, it is allocating more trials for “unlucky” realizations where the MSE would be higher, while the oracle-aided binomial method maintains a fixed number of trials.

## V Estimating Functions of a Bernoulli Parameter

When estimating an arbitrary function of a Bernoulli parameter is of interest [11], one can derive similar stopping strategies as before. In this section, we concern ourselves only with the estimation of due to its prevalence in real-life scenarios. For instance, the subjective brightness perceived by the human vision system is a logarithmic function of the incident light intensity [34, 35]. Also, the common log odds ratio , is approximately equal to when .

As before, we begin with a squared error loss

 L(p,⋅)=(f(p)−^f(⋅))2, (32)

where is the estimate of . The expectation of this loss function over gives the Bayes risk

 R(⋅)=E[L(p,⋅)]=E[(f(p)−^f(⋅))2]. (33)

For , suppose a sequence of trials leads to a node in the trellis corresponding to the posterior distribution . Under , the MMSE estimator of is [36]

 ^f(a,b):=E[logP]=ψ(0)(a)−ψ(0)(a+b), (34)

where is the polygamma function of order . The Bayes risk (33) when no additional trial is performed, , becomes the variance of [37]:

 (35)

If one additional trial is performed, the Bayes risk reduces to

 Rcont(a,b)= E[(1−P)R(a,b+1)+PR(a+1,b)] = ba+bR(a,b+1)+aa+bR(a+1,b). (36)

Hence, the Bayes risk reduction from one additional trial is

 Rstop(a,b)−Rcont(a,b)=ba(a+b)2. (37)

Starting with prior , the counterpart to (26) for estimation of is

 ΔR(k,m;α,β)=β+m−k(α+k)(α+β+m)2. (38)

As before, this can be used in (27) as an online threshold-based termination method.

The Bayes risk reductions for both in (26) and in (38), starting with a uniform prior, are shown as heat maps in Fig. 11. When , the reduction from additional trials after observing sequences with low number of successes is significantly larger. Thus, the online threshold-based termination of Section IV-C is likely to assign more trials for the smaller underlying Bernoulli parameters. Such a stopping rule is intuitive because a fixed amount of estimation error for would contribute more to the loss function defined in (32), with , when is small. In fact, one can choose a loss function which enforces different penalties for different values. An example is the family of weighted mean squared errors –  – where a weighting function is designed according to the problem. A special case is relative MSE , which is approximately the squared loss in (33) with for estimates sufficiently close to the true value [11].

When smaller values are of more importance, it makes sense to use a strategy that allocates more trials to these instances. Negative binomial sampling explained in Section III-B achieves this type of trial allocation. For the estimation of , we compare the performances of binomial sampling, negative binomial sampling, and online threshold-based termination in Fig. 12. Threshold-based termination outperforms both binomial and negative binomial sampling, with improvement factors of and , respectively.

## Vi Applications to Active Imaging

Active imaging systems typically raster scan the scene by probing patch , and , using pulsed illumination. The measured data – used to form an image of the scene – are arrays and ; i.e., the number of detections (successes) and number of illumination pulses (trials) for each scene patch. Note that the conventional approach of a fixed number of trials makes for all and random, whereas both and are random when the proposed approach is applied.

The parameters of the Bernoulli processes generated by probing a scene patch and its neighbors are typically correlated. This can be exploited in the image formation stage through mechanisms inspired by any of various image compression or denoising methods. For this initial demonstration of adaptive acquisition, we apply total variation (TV) regularization [38]. We present simulation results using the Shepp–Logan phantom in Fig. 3(a). Recall that this is scaled to .

### Vi-a Estimation of f(p)=p

We focus here on comparing conventional binomial sampling against online threshold-based termination (27) applied for each pixel. For , the relevant Bayes risk reduction per trial is given by (26).

#### Vi-A1 MMSE Estimation Under I.I.D. Prior

When not exploiting any spatial correlations, each pixel estimation is performed separately using the methods of Section IV-C. Under a prior, the MMSE estimate is .

Fig. 13 shows that with the choice of the prior, MSE improvement of is attained for trial budget . For the same choice of prior, we also perform independent experiments for two different trial budgets as indicated in Table I, which also shows similar improvement factors. These performance gains are close to the prediction from the trial allocation gain (Example 2(d)). Furthermore, we show in Fig. 14 that significant MSE improvements can be attained for a large range of Beta priors, and we also observe that the choice of mismatched priors degrades the performance of the binomial sampling whereas the threshold-based termination seem to be robust to such choices.

#### Vi-A2 TV-Regularized ML Estimation

Reconstruction quality can be improved through the use of TV-regularized ML estimation [38, 1]. In one typical experimental trial shown in Fig. 15, the TV-regularized reconstruction from data obtained with online threshold-based termination outperforms the conventional binomial sampling by in MSE; the trial budget of and prior of are the same as used previously. Furthermore, Table I also includes results averaged over 100 experiments for the same trial budgets as before. As can be seen, imposing TV regularization increases the performance gained from the data-adaptive stopping rule.