# Practical optimal experiment design with probabilistic programs

###### Abstract

Scientists often run experiments to distinguish competing theories. This requires patience, rigor, and ingenuity—there is often a large space of possible experiments one could run. But we need not comb this space by hand—if we represent our theories as formal models and explicitly declare the space of experiments, we can automate the search for good experiments, looking for those with high expected information gain. Here, we present a general and principled approach to experiment design based on probabilistic programming languages (PPLs). PPLs offer a clean separation between declaring problems and solving them, which means that the scientist can automate experiment design by simply declaring her model and experiment spaces in the PPL without having to worry about the details of calculating information gain. We demonstrate our system in two case studies drawn from cognitive psychology, where we use it to design optimal experiments in the domains of sequence prediction and categorization. We find strong empirical validation that our automatically designed experiments were indeed optimal. We conclude by discussing a number of interesting questions for future research.

Practical optimal experiment design with probabilistic programs

Long Ouyang*, Michael Henry Tessler*, Daniel Ly*, Noah D. Goodman Department of Psychology Stanford University Stanford, CA 94305 {louyang, mtessler, lydaniel, ngoodman}@stanford.edu

noticebox[b]29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

## 1 Introduction

Designing scientific experiments to test competing theories is hard. To distinguish theories, we would like to run experiments for which the theories make different predictions, but there are often many possible experiments one could run. Formalizing theories as mathematical models can help. Models make explicit hypotheses about observed data and thus make it easier (or in some cases, possible) to explore the implications of a set of theoretical ideas. However, exploring models can be time-consuming and searching for experiments where the models sufficiently diverge is still largely driven by the scientist’s intuition. This intuition may be biased in a number of ways, such as towards experiments that show qualitative differences between models even when more informative quantitative differences may exist.

When the space of models and of experiments have been made explicit, it is possible to use optimal experiment design (OED) to automate search; OED searches for experiments that maximally update our beliefs about a scientific question. The information-theoretic foundations of OED are fairly straightforward [1], but it has not enjoyed widespread use in practice. Some OED systems are too narrow to be of general use and the more general systems require too much conceptual and implementational know-how to be widely adopted (e.g., users must supply their own objective function and derive a solution algorithm for it). In order for OED to be both of general and practical use, the computation for experiment selection must be automatic. This is only possible with a common formalism for specifying hypotheses and experiments. Probabilistic programming languages (PPLs) are such a formalism; they are high-level and universal languages for expressing probabilistic models. In this work we describe a system in which the user expresses the competing hypotheses and possible experiments in a PPL; the optimal experiment is then computed with no further input from the user.

We first describe our framework in general terms and then apply it in two case studies from cognitive psychology. Psychology is a good target domain for OED: hypotheses can often be expressed as mathematical models, rapid experiment iteration is possible and beneficial, and there is a large community ready to use (but not necessarily develop) sophisticated tools. Psychological experiments also have certain challenging features: human participants give noisy responses, experimental results are sensitive to the size of one’s sample, and computational models often do not make direct predictions about experimental data, instead requiring linking functions to convert model output into predictions about data. Our system naturally addresses these concerns. In the first case study, we consider the problem of distinguishing three toy models of human sequence prediction. In the second case study, we go beyond toy models and analyze a classic paper on human category learning that compared two models using an experiment designed by hand. We find that OED discovers experiments that are several times more effective than the original in an information-theoretic sense. Our work opens a number of rich areas for future development, which we explore in the discussion.

## 2 Experiment design framework

We begin with a concrete example before giving formal details.
Imagine that we are studying how people predict values for sequence data (e.g., flips of a possibly-trick coin).
We want to compare two models: , in which people believe the coin is unbiased, and , in which people believe the coin has a bias that is unknown (expressed as a uniform prior on the unit interval) but that can be learned from data.
We have a uniform prior on the models and we wish to update this belief distribution by conducting an experiment where we show people four flips of the same coin and ask them to predict what will happen on the next flip.
There are 16 possible experiments (all combinations of H and T for 4 flips) and possible outcomes—predictions of heads or tails for each of human participants.
Each model is a probability distribution on conditional on the experiment —it describes a prediction about how people would respond after seeing some particular sequence of flips.
For convenience, we write our models in terms of a what a single person would do and assume that all people respond according to the same model, i.e., participant responses are i.i.d.^{1}^{1}1We use this simple linking function throughout this paper but our approach handles arbitrary linking functions (e.g., hierarchical models with subject-wise parameters).

How informative would running the experiment HHTT be? and are identical here (heads and tails are equally likely)—if we ran our experiment with a single participant, neither experimental result would update our beliefs about the models, so this is a poor experiment. By contrast, the experiment HHHH would be much more informative. Under , but under , . In this case, either experimental response would be informative. If the participant predicted heads, this would favor and if she predicted tails, this would favor . Thus, HHHH would be a good experiment to run to disambiguate these models. The goal of OED is to automate this reasoning.

We now formalize our framework. We wish to compare a set of models in terms of how well they account for empirical phenomena. A model is a conditional distribution representing the likelihood of empirical results for different possible experiments . We begin with a prior and aim to conduct an experiment that maximally updates this distribution, providing as much information as possible. That is, we wish to maximize . A priori, we do not know what the result of any particular experiment will be, so we must marginalize over the possible results :

(1) |

where is the probability of observing result for experiment . If we have reason to believe that contains the true model of the data, then a suitable choice for is the predictive distribution implied by the models . If, however, we think may not contain the true model, then an uninformative prior may be more appropriate.

### 2.1 Writing models as probabilistic programs

A key requirement for automating experiment design is to write hypotheses about the data as explicit models—in our case, probabilistic programs. We use the probabilistic programming language WebPPL (webppl.org), a small but feature-rich probabilistic programming language embedded in Javascript [2]. WebPPL supplies a number of primitive distributions (e.g. Binomial), which support generating samples and calculating the probability density of values from the domain. For instance, we can sample from using sample(Binomial({n: 4, p: 0.5})) and we can determine the log-probability of the value 2 under this distribution using score(Binomial({n: 4, p: 0.5}), 2). Often, we are interested in posterior inference. Let’s say we are interested in the distribution conditional on at least 2 successes. We write this as:

g is a function representing the conditional distribution.
Conceptually, it draws a sample x from the prior, rejects values less than 2 using condition (which enforces hard constraints^{2}^{2}2 An alternate form called factor generalizes condition, continuously weighting different program execution paths rather than simply accepting or rejecting them.) and returns x.
However, g is not directly runnable.
To reify the conditional distribution, we must perform marginal inference on this model using Infer(g, options), which yields a distribution object (a probability table).
This provides a useful separation—we distinguish what we wish to compute from how we try to compute it.
In the above snippet, and throughout, we omit the options object, which describes to Infer which inference algorithm to use. WebPPL currently provides several inference algorithms: MCMC (MH, HMC), SMC, enumeration for discrete models, and variational inference.

### 2.2 Writing OED as a probabilistic program

Surprisingly, after expressing the spaces of models, experiments, and responses as probabilistic programs, it is straightforward to express OED as a probabilistic program as well (see Listing 1). Equation 1 translates to around 20 lines of WebPPL code, expressing that OED is an inference problem. A rich language like WebPPL is particularly well-suited here, as we lean heavily on the ability to perform nested inference. Also, writing OED as a probabilistic program gives us access to algorithms that are more sophisticated than previous research has considered (e.g., mixtures of enumeration and HMC for experiment spaces that have continuous and discrete subspaces). Finally, note that we implement search for the optimal experiment using inference. This is not essential—we could also replace the outermost Infer() call with an optimization procedure (e.g., Search()).

Our OED code is available as a WebPPL package at https://censored. We next illustrate our system by applying it to distinguish psychological theories of sequence prediction.

## 3 Case study 1: Sequence prediction

Human judgments about sequences are surprisingly systematic and nonuniform across equally likely outcomes – for example, we might strongly believe the next coin flip in the sequence HHTTHHTT will be H, whereas we might be unsure for the sequence THHTHTHT. There are many hypotheses one might have about what underlies human intuitions about such sequences [3, 4, 5]. Here, we consider three simple models of people’s beliefs: (a) Fair coin: people assume the coin is fair, (b) Bias coin: people believe the coin has some unknown bias (i.e., the probability of a H outcome) that they can learn from data, (c) Markov coin: people believe the coin has some probability of transitioning between spans of H and T outcomes, also learnable from the data. As in our earlier example, we consider an experimental setup where participants see four flips of the same coin and must predict the next flip.

### 3.1 Formalization

The model space is .
For now, we assume that the experiment will include data from a single participant, so the experiment space is the Cartesian product representing the fixed sample size of 1 and sequence space.^{3}^{3}3Our notion of “experiment” is quite general, including traditional components like stimulus properties (e.g., coin sequence) as well as other components like dependent measure and sample size.
Finally, is the response set .

In , we model participants as believing that the coin has an equal probability of coming up heads or tails:

Here, flip(0.5) is shorthand for sample(Bernoulli({p:0.5})). Note the type signature of this model—it takes as input an experiment and returns a distribution on possible results of that experiment.

In , people assume the coin has some unknown bias, learn it from observations, and use it to predict the next coin flip:

biasCoin first samples a coin weight w from a uniform prior, uses it to sample a sequence of flips, and conditions on this matching the observed sequence seq. Thus, it learns the likely coin weights to have generated the observed sequence, and makes a prediction about the next flip.

Finally, (see supplement for code) assumes that the flips are generated by a Markov process where the first coin flip is uniform and subsequent flips have some probability p of transitioning away from the previous value. p is learned from the data and used to predict the next flip.

### 3.2 Predictions of optimal experiment design

Using an uninformative prior for , we ran OED for three different model comparison: fair–bias, bias–Markov, and fair–bias–Markov and planned to collect data from 20 participants (rather than 1).

In the example call to OED (Fig. 1a), we first lift each single-participant model into a model of group responses using an i.i.d. linking function groupify (see supplement). The experiment space xSample includes a fixed number of participants and the unique sequences. The response space ySample is an uninformative prior over the number of H responses.

Consider the fair–bias comparison (Fig. 1b, left). Observe that several experiments that have 0 information gain (e.g., HTHT). The models make exactly the same predictions in this case (albeit for different reasons), so the experiment has no distinguishing power. The best experiments are HHHH and TTTT. This is intuitive—the bias model would infer a strongly biased coin and make a strong prediction, while the fair coin model is unaffected by the observed sequence.

Moving to the bias–Markov comparison (Fig. 1b, middle), the best and worst experiments actually reverse. Now, HHHH and TTTT are the least informative (because, as before, the models make similar predictions here), whereas HTHT and THTH are the most informative. This makes sense—the bias model learns a 0.5 probability of heads and so assigns equal probability of heads and tails to the next flip, whereas the Markov model learns that the transition probability is high and assigns high probability to the opposite of whatever outcome was observed last (T for THTH and H for HTHT).

(a) | (b) |
---|---|

\includegraphics[width=0.6]img/coin_predictions.pdf | \includegraphics[width=0.4]img/coin_eig_3way_nsubj_wlegend.pdf |

In the full bair–bias–Markov comparison (Fig. 1b, right), the worst experiments (e.g., TTHH) are again cases where all models make similar predictions. The best experiments are TTTT and HHHH, a result that is non-obvious because we are comparing three models rather than two. The best experiment HHHH is very good at separating the fair model from the other two models, while still predicting a difference between bias and Markov (Fig. 2a, right). The second best experiment, HHHT, better distinguishes the bias model from the Markov model as it predicts a qualitative difference (Fig. 2a, middle), but this comes at the cost of less expected information gain overall. An automated design tool is especially useful in these settings, where human intuition would likely favor the qualitative over the quantitative difference.

Finally, expected information gain of an experiment varies as a function of sample size (Fig. 2b). This function is non-linear and, crucially, the rank ordering of experiments can change. For the the full model comparison, the experiments HTHT and HHHT switch places after 12 participants. This is particularly relevant when three models are being compared, as small quantitative differences between two models may amplify as the sample size grows. In our example here, the optimal experiment with 1 participant is the same as with 30 participants.

### 3.3 Empirical validation

We validated our system by collecting human judgements for all 16 experiments and comparing expected information gain with the actual information gain from the empirical results. We randomly assigned 351 participants to an experiment (all of the 16 experiments were completed by 20 unique participants). Participants pressed a key to sequentially reveal the sequence of 4 flips and then predicted the next coin flip (either heads or tails).

For each experiment and result , we computed the expected information gain from running our empirical sample of participants^{4}^{4}4N’s are uneven due to randomization. We use the empirical N’s for EIG in comparisons to AIG. and compared this to the actual information gain, , for the three model comparison scenarios.
Figure 3 shows that expected information gain is a reliable predictor of the empirical value of an experiment (minimum = 0.857). This indicates that the OED tool could be relied on to automatically choose good experiments for this case study.

## 4 Case study 2: Category learning

Here, we explore a more complex and realistic space of models and experiments. In particular, we analyze a classic paper on the psychology of categorization by Medin and Schaffer [6] that aimed to distinguish two competing models of category learning – the exemplar model and the prototype model. Using intuition, Medin and Schaffer (MS) designed an experiment (often referred to as the “5-4 experiment”) where the models made diverging predictions and found that the results supported the exemplar model. Subsequently, many other authors followed their lead, replicating and using this experiment to test other competing models. Here, we ask: how good was the MS 5-4 experiment? Could they have run an experiment that would have distinguished the two models with less data?

### 4.1 Models

Both the exemplar and prototype models are classifiers that map inputs (objects represented as a vector of Boolean features) to a probability distribution on the categorization response (a label: A or B). The exemplar model assumes people store information about every instance of the category they have observed; categorizing an object is thus a function of the object’s similarity to all of the examples of category A versus the similarity to all of B’s examples. By contrast, the prototype model assumes that people store a measure of central tendency for each category—a prototype. Categorization of an object is thus a function of its similarity to the A prototype versus its similarity to the B prototype. For details and representation of these models in WebPPL, see the supplement.

### 4.2 Experiments

Participants first learn about the category structures in a training phase where they perform supervised learning of a subset of the objects and are then tested on this learning in a test phase. During training, participants see a subset of the objects presented one at a time and must label each object. Initially, they can only guess at the labels, but they receive feedback so that they can eventually learn the category assignments. After reaching a learning criterion, they complete the test phase, where they label all the objects (training set and the held out test set) without feedback.

MS used visual stimuli that varied on 4 binary dimensions (color: red vs. green, shape: triangle vs. circle, size: small vs. large, and count: 1 vs. 2). For technical reasons, they considered only experiments that (1) have linearly separable decision boundaries, (2) contain 5 A’s and 4 B’s in the training set, and (3) have the modal A object 1111 and the modal B object 000. There are, up to permutation, 933 experiments that satisfy these constraints.

### 4.3 Predictions of optimal experimental design

Using the predictive prior for , we computed the expected information gain for all 933 experiments and found that the best experiment (for a single participant) had an expected information gain of 0.08 nats, whereas the MS 5-4 experiment had an expected information gain of only 0.03 nats. Thus, the optimal experiment is expected to be 2.5 times more informative than the MS experiment. Indeed, the MS experiment is near the bottom third of all experiments (Fig. 4a).

Why is the MS experiment ineffective? One reason is that Medin and Schaffer prioritized experiments that predict a qualitative categorization difference (i.e., when one model predicts that an object is an A while the other predicts it is a B). The experiment they designed indeed predicts a qualitative difference for one object but this difference has a small magnitude and comes at the expense of little information gain from the remaining objects. The optimal experiment is better able to quantitatively disambiguate the models by maximizing the information from all the objects simultaneously.

(a) | (b) |
---|---|

\includegraphics[width=2.5in]img/category-eig-dist.pdf | \includegraphics[width=2.5in]img/category-aig-curve.pdf |

### 4.4 Empirical validation

To validate our expected information gain calculations, we ran the MS 5-4 and the optimal experiment with 60 participants each. Figure 4b shows that the optimal experiment we found for a single participant is indeed better than the MS experiment (=1, blue greater than red). For =1, the mean actual information gain for the optimal experiment is 0.15, whereas it is 0.026 for the MS experiment. This 5-fold difference in informativity is even greater than the 2.5-fold difference predicted by expected information gain. In addition, by incrementally introducing more data, we observe that both experiments achieve maximal actual information gain but the optimal experiment takes only 10 participants to asymptote to this maximum whereas the MS experiment takes around 30. Thus, the optimal experiment provides the same amount of information for a third of the experimental cost.

## 5 Related work

The basic intuition behind OED—to find experiments that maximize some expected measure of informativeness—has been independently discovered in a number of fields, including physics [7], chemistry [8], biology [9, 10], psychology [11], statistics [1], and machine learning [12].

These papers, however, often implement OED for relatively limited cases, specializing to particular model classes and committing to a single inference technique. For example, in systems biology, Liepe et al. [10] devised a method for finding experiments that optimize information gain for parameters of biomolecular models (ODEs with Gaussian noise). Their information measure (Shannon entropy) is similar to ours, but they focus on a narrow family of models and commit to a bespoke inference technique (an ABC scheme based on SMC). In psychology, Myung & Pitt [11] devised a general design optimization method but this method requires researchers to select their own utility function for the value of an experiment and implement inference on their own. For example, they compared six memory retention models using Fisher Information Approximation as a utility function and performed inference using a custom annealed SMC algorithm. Such “bring-your-own” requirements impose a significant burden on practitioners and are a real barrier to entry.

By contrast, we show that OED can be expressed as a generic, concise, and flexible function in a probabilistic programming language, which allows practitioners to rapidly explore different spaces of models, experiments, and inference algorithms. Additionally, our work is the first to (1) demonstrate that expected information gain is a reliable predictor of actual information gain and to (2) characterize the cost benefits of OED.

## 6 Conclusion

Practitioners aim to design experiments that yield informative results. Our approach partially automates experiment design, searching for experiments that maximally update beliefs about the model distribution. With our approach, the scientist writes her hypotheses as probabilistic programs, sketches a space of possible experiments, and hands these to OED for experiment selection. We stress that our work complements practitioners; it does not replace them. Our tool eliminates the need to manually comb large spaces for good experiments; we hope this will free scientists and engineers to work on the more interesting problems—devising empirical paradigms and building models.

Our approach suggests a number of interesting directions for future work. We cast the OED problem as a problem of inference and this might suggest particular inference techniques. For instance, if a particular response is quite unlikely (i.e., is negligible), it may be acceptable to have a less precise estimate of information gain for that response. Additionally, while our framework can accommodate different costs of experiments using a prior distribution, the focus of our work is on finding informative experiments. It could be useful to integrate our system into multi-objective optimization systems for balancing multiple design considerations (e.g., informativeness, cost, ethics). Finally, we have explored the trajectory of information gain as the number of i.i.d. observations increases. Observations need not independent, however. Adaptive testing can be formulated as a problem of information gain of sequences of experiments, which produce dependent and non-identical responses. In this paper, we showed case studies of our method in cognitive psychology but we believe that it is broadly useful, so we invite practitioners to test our method and system.

## References

- [1] D. V. Lindley, “On a measure of information provided by an experiment,” Annals of Mathematical Statistics, vol. 27, no. 4, pp. 986–1005, 1956.
- [2] N. D. Goodman and A. Stuhlmüller, “The Design and Implementation of Probabilistic Programming Languages.” http://dippl.org, 2014. Accessed: 2016-5-2.
- [3] L. D. Goodfellow, “A psychological interpretation of the results of the Zenith radio experiments in telepathy,” Journal of Experimental Psychology, vol. 23, no. 6, pp. 601–623, 1938.
- [4] R. Falk, “The perception of randomness,” International Conference for the Psychology of Mathematics Education, vol. 1, pp. 222–229, 1981.
- [5] T. L. Griffiths and J. B. Tenenbaum, “From algorithmic to subjective randomness,” in Advances in Neural Information Processing Systems, 2004.
- [6] D. L. Medin and M. M. Schaffer, “Context Theory of Classification Learning,” Psychological Review, vol. 85, no. 3, pp. 207–238, 1978.
- [7] J. van Den Berg and A. Curtis, “Optimal nonlinear Bayesian experimental design: an application to amplitude versus offset experiments,” Geophysical Journal International, vol. 155, no. 2, pp. 411–421, 2003.
- [8] X. Huan, “Accelerated Bayesian experimental design for chemical kinetic models,” Master’s thesis, Massachusetts Institute of Technology, 2010.
- [9] J. Vanlier, C. A. Tiemann, P. A. J. Hilbers, and N. A. W. van Riel, “A Bayesian approach to targeted experiment design,” Bioinformatics, vol. 28, pp. 1136–1142, Apr. 2012.
- [10] J. Liepe, S. Filippi, M. Komorowski, and M. P. H. Stumpf, “Maximizing the Information Content of Experiments in Systems Biology,” PLoS Computational Biology, vol. 9, p. e1002888, Jan. 2013.
- [11] J. I. Myung and M. A. Pitt, “Optimal experimental design for model discrimination.,” Psychological review, vol. 116, no. 3, pp. 499–518, 2009.
- [12] D. Golovin, A. Krause, and D. Ray, “Near-optimal bayesian active learning with noisy observations,” in Advances in Neural Information Processing, 2010.