Autofocused oracles for model-based design

Autofocused oracles for model-based design


Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a therapeutic target, or a superconducting material with a higher critical temperature than previously observed. To that end, costly experimental measurements are being replaced with calls to high-capacity regression models trained on labeled data, which can be leveraged in an in silico search for design candidates. However, the design goal necessitates moving into regions of the design space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the design space, in the absence of new data? Herein, we answer this question in the affirmative. In particular, we (i) formalize the data-driven design problem as a non-zero-sum game, (ii) develop a principled strategy for retraining the regression model as the design algorithm proceeds—what we refer to as autofocusing, and (iii) demonstrate the promise of autofocusing empirically.

1 Oracle-based design

The design of objects with desired properties, such as novel proteins, molecules, or materials, has a rich history in bioengineering, chemistry, and materials science. In these domains, design has historically been performed through iterative, labor-intensive experimentation Arnold (2018) (e.g., measuring protein binding affinity) or compute-intensive physics simulations Hafner (2008) (e.g., computing low-energy structures for nanomaterials). Increasingly, however, attempts are being made to replace these costly and time-consuming steps with cheap and fast calls to a proxy regression model, trained on labeled data Yang et al. (2019); Wu et al. (2019); Bedbrook et al. (2019); Mansouri Tehrani et al. (2018); Biswas et al. (2020). Herein, we refer to such a proxy model as an oracle, and assume that acquisition of training data for the oracle is complete, as in Bedbrook et al. (2019); Mansouri Tehrani et al. (2018); Biswas et al. (2020); Popova et al. (2018); Gómez-Bombarelli et al. (2018).1 The key issue addressed by our work is how best to train an oracle for use in design, given fixed training data.

In contrast to the traditional use of predictive models, oracle-based design is distinguished by the fact that it seeks solutions—and therefore, will query the oracle—in regions of the design space that are not well-represented by the oracle training data. If this is not the case, the design problem is easy in that the solution is within the region of the training data. Furthermore, one does not know beforehand which parts of the design space a design procedure will navigate through. As such, a major challenge arises when an oracle is employed for design: its outputs, including its uncertainty estimates, become unreliable beyond the training data Nguyen et al. (2015); Brookes et al. (2019). Successful oracle-based design thus involves an inherent trade-off between the need to stay “near” the training data in order to trust the oracle, and the need to depart from it in order to make improvements. While trust region approaches have been developed to help address this trade-off Brookes et al. (2019); Biswas et al. (2020), herein, we take a different approach and ask: what is the most effective way to use a fixed, labeled dataset to train an oracle for design?


We develop a novel approach to oracle-based design that specifies how to update the oracle as the design space is explored—what we call autofocusing the oracle. In particular, we (i) formalize oracle-based design as a non-zero-sum game, (ii) derive an oracle-updating strategy for seeking a Nash equilibrium, and (iii) demonstrate empirically that autofocusing holds promise for improving oracle-based design.

2 Model-based optimization for design

Design problems can be cast as seeking points in the design space, , that with high probability satisfy desired conditions on a property random variable, . For example, one might want to design a superconducting material by specifying its chemical composition, , such that the resulting material has critical temperature greater than some threshold, , or has maximal critical temperature, . We specify the desired properties using a constraint set, , such as for some . The design goal is then to solve . This optimization problem over the inputs, , can be converted to one over distributions over the design space Zlochin et al. (2004); Brookes et al. (2020). Specifically, model-based optimization (MBO) seeks the parameters, , of a “search model”, , that maximizes an objective that bounds the original objective:


The original optimization problem over , and the MBO problem over , are equivalent when the search model has the capacity to place point masses on optima of the original objective. Reasons for using the MBO formulation include that it requires no gradients of , thereby allowing the use of arbitrary oracles for design, including those that are not differentiable with respect to the design space and otherwise require specialized treatment Linder and Seelig (2020). MBO also naturally allows one to obtain not just a single design candidate, but a diverse set of candidates, by sampling from the final search distribution (whose entropy can be adjusted by adding regularization to Equation 1). Finally, MBO introduces the language of probability into the optimization, thereby allowing coherent incorporation of probabilistic constraints such as implicit trust regions Brookes et al. (2019). The search model can be any parameterized probability distribution that can be sampled from, and whose parameters can be estimated using weighted maximum likelihood estimation (MLE) or approximations thereof. Examples include mixtures of Gaussians, hidden Markov models, variational autoencoders Kingma and Welling (2014), and Potts models Marks et al. (2011). Notably, the search model distribution can be over discrete or continuous random variables, or a combination thereof.

We use the phrase model-based design (MBD) to denote use of MBO to solve a design problem. Hereafter, we focus on oracle-based MBD, which attempts to solve Equation 1 by replacing costly and time-consuming queries of the ground truth2, , with calls to a trained regression model (i.e., oracle), , with parameters, . Given access to a fixed dataset, , the oracle is typically trained once using standard techniques and thereafter considered fixed Bedbrook et al. (2019); Mansouri Tehrani et al. (2018); Biswas et al. (2020); Popova et al. (2018); Gómez-Bombarelli et al. (2018); Brookes et al. (2019); Gupta and Zou (2019); Killoran et al. (2017). In what follows, we describe why such a strategy is sub-optimal and how to re-train the oracle in order to better achieve design goals. First, however, we briefly review a common approach for performing MBO, as we will leverage such algorithms in our approach.

2.1 Solving model-based optimization problems

MBO problems are often tackled with an Estimation of Distribution Algorithm (EDA) Bengoetxea et al. (2001); Baluja and Caruana (1995), a class of iterative optimization algorithms that can be seen as Monte Carlo expectation-maximization Brookes et al. (2020); EDAs are also connected to the cross-entropy method Rubinstein (1997, 1999) and reward-weighted regression in reinforcement learning Peters and Schaal (2007). Given an oracle, , and an initial search model, , an EDA typically proceeds at iteration with two core steps:

  1. “E-step”: Sample from the current search model, for all . Compute a weight for each sample, , where is a method-specific, monotonic transformation.

  2. “M-step”: Perform weighted MLE to yield an updated search model, , which tends to have more mass where is high. (Some EDAs can be seen as performing maximum a posteriori inference instead, which results in smoothed parameter updates Brookes et al. (2019).)

Upon convergence of the EDA, design candidates can be sampled from the final search model if it is not a point mass; one may also choose to use promising samples from earlier iterations. Notably, the oracle, , remains fixed in the steps above. Next, we motivate a new formalism for oracle-based MBD that yields a principled approach for updating the oracle at each iteration.

3 Autofocused oracles for model-based design

The common approach of substituting the oracle, , for the ground-truth, , does not address the fact that the oracle is only likely to be reliable over the distribution from which its training data were drawn Nguyen et al. (2015); Sugiyama et al. (2007); Quiñonero-Candela et al. (2008). To address this problem, we now reformulate the MBD problem as a non-zero-sum game, which suggests an algorithmic strategy for iteratively updating the oracle within any MBO algorithm.

3.1 Model-based design as a game

When the objective in Equation 1 is replaced with an oracle-based version,


the solution to the oracle-based problem will, in general, be sub-optimal with respect to the original objective that uses the ground truth, . This sub-optimality can be extreme due to pathological behavior of the oracle when the search model, , strays too far from the training distribution during the optimization Brookes et al. (2019).

Since one cannot access the ground truth, we seek a practical alternative wherein we can leverage an oracle, but also infer when the values of the ground-truth and oracle-based objectives (in Equations 1 and 2, respectively) are likely to be close. To do so, we introduce the notion of the oracle gap, defined as . When this quantity is small, then by Jensen’s inequality the oracle-based and ground-truth objectives are close. Consequently, our insight for improving oracle-based design is to use the oracle that minimizes the oracle gap,


Together, Equations 2 and 3 define the coupled objectives of two players, namely the search model (with parameters ) and the oracle (with parameters ), in a non-zero-sum game. To attain good objective values for both players, our goal will be to search for a Nash equilibrium—that is, a pair of values such that neither can improve its objective given the other. To do so, we develop an alternating ascent-descent algorithm, which alternates between (i) fixing the oracle parameters and updating the search model parameters to increase the objective in Equation 2 (the ascent step), and (ii) fixing the search model parameters and updating the oracle parameters to decrease the objective in Equation 3 (the descent step). In the next section, we describe this algorithm in more detail.

Practical interpretation of the MBD game.

Interpreting the usefulness of this game formulation requires some subtlety. The claim is not that every Nash equilibrium yields a search model that provides a high value of the (unknowable) ground-truth objective in Equation 1. However, for any pair of values, , the value of the oracle gap provides a certificate on the value of the ground-truth objective. In particular, if one has an oracle and search model that yield an oracle gap of , then by Jensen’s inequality the ground-truth objective is within of the oracle-based objective. Therefore, to the extent that we are able to minimize the oracle gap (Equation 3), we can trust the value of our oracle-based objective (Equation 2). Note that a small, or even zero oracle gap only implies that the oracle-based objective is trustworthy; successful design also entails achieving a high oracle-based objective, the potential for which depends on an appropriate oracle class and suitably informative training data (as it always does for oracle-based design, regardless of whether our framework is used).

Although the oracle gap as a certificate is useful conceptually for motivating our approach, at present it is not clear how to estimate it. In our experiments, we found that we could demonstrate the benefits of autofocusing without directly estimating the oracle gap, relying solely on the principle of minimizing it. We also note that in practice, what matters is not whether we converge to a Nash equilibrium, just as what matters in empirical risk minimization is not whether one exactly recovers the global optimum, only a useful point. That is, if we can find parameters, , that yield better designs than alternative methods, then we have developed a useful method.

3.2 An alternating ascent-descent algorithm for the MBD game

Our approach alternates between an ascent step that updates the search model, and a descent step that updates the oracle. The ascent step is relatively straightforward as it leverages existing MBO algorithms. The descent step, however, requires some creativity. In particular, for the ascent step, we run a single iteration of an MBO algorithm as described in section 2.1, to obtain a search model that increases the objective in Equation 2. For the descent step, we aim to minimize the oracle gap in Equation 3 by making use of the following observation (proof in Supplementary Material section S2).

Proposition 1.

For any search model, , if the oracle parameters, , satisfy


where is the Kullback-Leibler (KL) divergence between distributions and , then the following bound holds:


As a consequence of Proposition 1, given any search model, , an oracle that minimizes the expected KL divergence in Equation 4 also minimizes an upper bound on the oracle gap. Our descent strategy is therefore to minimize this expected divergence. In particular, as shown in the Supplementary Material section S2, the resulting oracle parameter update at iteration can be written as , where we refer to the objective as the log-likelihood under the search model. Although we cannot generally access the ground truth, , we do have labeled training data, , whose labels come from the ground-truth distribution, . We therefore use importance sampling with the training distribution, , as the proposal distribution, to obtain a now-practical oracle parameter update,


The training points, , are used to estimate some model for , while is given by the search model. We discuss the variance of the importance weights, , shortly.

Together, the ascent and descent steps amount to appending a “Step 3” to each iteration of the generic two-step MBO algorithm outlined in section 2.1, in which the oracle is retrained on re-weighted training data according to Equation 6. We call this strategy autofocusing the oracle, as it retrains the oracle in lockstep with the search model, to keep the oracle likelihood maximized on the most promising regions of the design space. Pseudo-code for autofocusing can be found in the Supplementary Material (Algorithms 1 and 2). As shown in the experiments, autofocusing tends to improve the outcomes of design procedures, and when it does not, no harm is incurred relative to the naive approach with a fixed oracle. Before discussing such experiments, we first make some remarks.

3.3 Remarks on autofocusing

Controlling variance of the importance weights.

It is well known that importance weights can have high, even infinite, variance Owen (2013), which may prevent the importance-sampled estimate of the log-likelihood from being useful for retraining the oracle effectively. That is, solving Equation 6 may not reliably yield oracle parameter estimates that minimize the log-likelihood under the search model. To monitor the reliability of the importance-sampled estimate, one can compute and track an effective sample size of the re-weighted training data, , which reflects the variance of the importance weights Owen (2013). If one has some sense of a suitable sample size for the application at hand (e.g., based on the oracle model capacity), then one could monitor and choose not to retrain when it is too small. Another variance control strategy is to use a trust region to constrain the movement of the search model, such as in Brookes et al. (2019), which automatically controls the variance (see Supplementary Material Proposition S2.2). Indeed, our experiments show how autofocusing works synergistically with a trust-region approach. Finally, two other common strategies are: (i) self-normalizing the weights, which provides a biased but consistent and lower-variance estimate Owen (2013), and (ii) flattening the weights Sugiyama et al. (2007) to according to a hyperparameter, . The value of interpolates between the original importance weights (), which provide an unbiased but high-variance estimate, and all weights equal to one (), which is equivalent to naively training the oracle (i.e., no autofocusing).

Oracle bias-variance trade-off.

If the oracle equals the ground truth over all parts of the design space encountered during the design procedure, then autofocusing should not improve upon using a fixed oracle. In practice, however, this is unlikely to ever be the case—the oracle is almost certain to be misspecified and ultimately mislead the design procedure with incorrect inductive bias. It is therefore interesting to consider what autofocusing does from the perspective of the bias-variance trade-off of the oracle, with respect to the search model distribution. On the one hand, autofocusing retrains the oracle using an unbiased estimate of the log-likelihood over the search model. On the other hand, as the search model moves further away from the training data, the effective sample size available to train the oracle decreases; correspondingly, the variance of the oracle increases. In other words, when we use a fixed oracle (no autofocusing), we prioritize minimal variance at the expense of greater bias. With pure autofocusing, we prioritize reduction in bias at the expense of higher variance. Autofocusing with techniques to control the variance of the importance weights Sugiyama et al. (2007, 2012) enables us to make a suitable trade-off between these two extremes.

Autofocusing corrects design-induced covariate shift.

In adopting an importance-sampled estimate of the training objective, Equation 6 is analogous to the classic covariate shift adaptation strategy known as importance-weighted empirical risk minimization Sugiyama et al. (2007, 2012). We can therefore interpret autofocusing as dynamically correcting for covariate shift induced by a design procedure, where, at each iteration, a new “test” distribution is given by the updated search model. Furthermore, we are in the fortunate position of knowing the exact parametric form of the test density at each iteration, which is simply that of the search model. This view highlights that the goal of autofocusing is not necessarily to increase exploration of the design space, but to provide a more useful oracle wherever the search model does move (as dictated by the underlying method to which autofocusing is added).

4 Related Work

Although there is no cohesive literature on oracle-based design in the fixed-data setting, its use is gaining prominence in several application areas, including the design of proteins and nucleotide sequences Biswas et al. (2020); Brookes et al. (2019); Gupta and Zou (2019); Killoran et al. (2017); Kumar and Levine (2019), molecules Olivecrona et al. (2017); Popova et al. (2018); Gómez-Bombarelli et al. (2018), and materials Mansouri Tehrani et al. (2018); Hautier et al. (2010). Within such work, the danger in extrapolating beyond the training distribution is not always acknowledged or addressed. In fact, proposed design procedures often are validated under the assumption that the oracle is always correct Popova et al. (2018); Linder and Seelig (2020); Gupta and Zou (2019); Killoran et al. (2017); Olivecrona et al. (2017). Some exceptions include Conditioning by Adaptive Sampling (CbAS) Brookes et al. (2019), which employs a probabilistic trust-region approach using a model of the training distribution, and Biswas et al. (2020), which uses a hard distance-based threshold. Similar in spirit to Brookes et al. (2019), Linder et al. regularize the designed sequences based on their likelihood under a model of the training distribution Linder et al. (2020). In another approach, a variational autoencoder implicitly enforces a trust region by constraining design candidates to the probabilistic image of the decoder Gómez-Bombarelli et al. (2018). Finally, Kumar & Levine tackle design by learning the inverse of a ground-truth function, which they constrain to agree with an oracle, so as to discourage too much departure from the training data Kumar and Levine (2019). None of these approaches update the oracle in any way. However, autofocusing is entirely complementary to and does not preclude the additional use of any of these approaches. For example, we demonstrate in our experiments that autofocusing improves the outcomes of CbAS, which implicitly inhibits the movement of the search model away from the training distribution.

Related to the design problem is that of active learning in order to optimize a function, using for example Bayesian optimization Snoek et al. (2012). Such approaches are fundamentally distinct from our setting in that they dynamically acquire new labeled data, thereby more readily allowing for correction of oracle modeling errors. In a similar spirit, evolutionary algorithms sometimes use a “surrogate” model of the function of interest (equivalent to our oracle), to help guide the acquisition of new data Jin (2011). In such settings, the surrogate may be updated using an ad hoc subset of the data Le et al. (2013) or perturbation of the surrogate parameters Schmidt and Lipson (2008). Similarly, a recent reinforcement-learning based approach to biological sequence design relies on new data to refine the oracle when moving into a region of design space where the oracle is unreliable Angermueller et al. (2019).

Offline reinforcement learning (RL) Levine et al. (2020) shares similar characteristics with our problem in that the goal is to find a policy that optimizes a reward function, given only a fixed dataset of trajectories sampled using another policy. In particular, offline model-based RL leverages a learned model of dynamics that may not be accurate everywhere. Methods in that setting have attempted to account for the shift away from the training distribution using uncertainty estimation and trust-region approaches Chua et al. (2018); Deisenroth and Rasmussen (2011); Rhinehart et al. (2020); importance sampling has also been used for off-policy evaluation Precup et al. (2001); Thomas and Brunskill (2016).

As noted in the previous section, autofocusing operates through iterative retraining of the oracle in order to correct for covariate shift induced by the movement of the search model. It can therefore be connected to ideas from domain adaptation more broadly Quiñonero-Candela et al. (2008). Finally, we note that mathematically, oracle-based MBD is related to the decision-theoretic framework of performative prediction Perdomo et al. (2020). Perdomo et al. formalize the phenomenon in which using predictive models to perform actions induces distributional shift, then present theoretical analysis of repeated retraining with new data as a solution. Our problem has two major distinctions from this setting: first, the ultimate goal in design is to maximize an unknowable ground-truth objective, not to minimize risk of the oracle. The latter is only relevant to the extent that it helps us achieve the former, and our work operationalizes that connection by formulating and minimizing the oracle gap. Second, we are in a fixed-data setting. Our work demonstrates the utility of adaptive retraining even in the absence of new data.

5 Experiments

We now demonstrate empirically, across a variety of both experimental settings and MBO algorithms, how autofocusing can help us better achieve design goals. First we leverage an intuitive example to gain detailed insights into how autofocus behaves. We then conduct a detailed study on a more realistic problem of designing superconducting materials. Code for our experiments is available at

5.1 An illustrative example

Figure 1: Illustrative example. Panels (a-d) show detailed snapshots of the MBO algorithm, CbAS Brookes et al. (2019), with and without autofocusing (AF) in each panel. The vertical axis represents both values (for the oracle and ground truth) and probability density values (of the training distribution, , and search distributions, ). Shaded envelopes correspond to standard deviation of the oracles, , with the oracle expectations, , shown as a solid line. Specifically, (a) at initialization, the oracle and search model are the same for AF and non-AF. Intermediate and final iterations are shown in (b-d), where the non-AF and AF oracles and search models increasingly diverge. Greyscale of training points corresponds to their importance weights used for autofocusing. In (d), each star and dotted horizontal line indicate the ground-truth value corresponding to the point of maximum density, indicative of the quality of the final search model (higher is better). The values of used here correspond to the ones marked by an in Figure 2, which summarizes results across a range of settings. Panels (e,f) show the search model over all iterations without and with autofocusing, respectively.
Figure 2: Improvement from autofocusing (AF) over a wide range of settings of the illustrative example. Each colored square shows the improvement (averaged over trials) conferred by AF for one setting, , of, respectively, the standard deviations of the training distribution and the label noise. Improvement is quantified as the difference between the ground-truth objective in Equation 1 achieved by the final search model with and without AF. A positive value means AF yielded higher ground-truth values (i.e., performed better than without AF), while zero means it neither helped nor hurt. Similar plots to Figure 1 are shown in the Supplementary Material for other settings (Figure S1).

To investigate how autofocusing works in a setting that can be understood intuitively, we constructed a one-dimensional design problem where the goal was to maximize a multi-modal ground-truth function, , given fixed training data (Figure 1a). The training distribution from which training points were drawn, , was a Gaussian with variance, , centered at , a point where is small relative to the global maximum at . This captures the common scenario where the oracle training data do not extend out to global optima of the property of interest. As we increase the variance of the training distribution, , the training data become more and more likely to approach the global maximum of . The training labels are drawn from , where is the variance of the label noise. For this example, we used CbAS Brookes et al. (2019), an MBO algorithm that employs a probabilistic trust region. We did not control the variance of the importance weights.

An MBO algorithm prescribes a sequence of search models as the optimization proceeds, where each successive search model is fit using weighted MLE to samples from its predecessor. However, in our one-dimensional example, one can instead use numerical quadrature to directly compute each successive search model Brookes et al. (2019). Such an approach enables us to abstract out the particular parametric form of the search model, thereby more directly exposing the effects of autofocusing. In particular, we used numerical quadrature to compute the search model density at iteration as , where belongs to a sequence of relaxed constraint sets such that  Brookes et al. (2019). We computed this sequence of search models in two ways: (i) without autofocusing, that is, with a fixed oracle trained once on equally weighted training data, and (ii) with autofocusing, that is, where the oracle was retrained at each iteration. In both cases, the oracle was of the form , where was fit by kernel ridge regression with a radial basis function kernel and was set to the mean squared error between and the labels (see Supplementary Material section S3 for more details). Since this was a maximization problem, the desired condition was set as (where for the initial oracle). We found that autofocusing more effectively shifts the search model toward the ground-truth global maximum as the iterations proceed (Figure 1b-f), thereby providing improved design candidates.

To understand the effect of autofocusing more systematically, we repeated the experiment just described across a range of settings of the variances of the training distribution, , and of the label noise, (Figure 2). Intuitively, both these variances control how informative the training data are about the ground-truth global maximum: as increases, the training data are more likely to include points near the global maximum, and as decreases, the training labels are less noisy. Therefore, if the training data are either too uninformative (small and/or large ) or too informative (large and/or small ), then one would not expect autofocusing to substantially improve design. In intermediate regimes, autofocusing should be particularly useful. Such a phenomenon is seen in our experiments (Figure 2). Importantly, this kind of intermediate regime is one in which practitioners are likely to find themselves: the motivation for design is often sparked by the existence of a few examples with property values that are exceptional compared to most known examples, yet the design goal is to push the desired property to be more exceptional still. In contrast, if the true global optimum already resides in the training data, one cannot hope to design anything better anyway. However, even in regimes where autofocusing does not help, on average it does not hurt relative to a naive approach with a fixed oracle (Figure 2 and Supplementary Material section 5.1).

5.2 Designing superconductors with maximal critical temperature

Designing superconducting materials with high critical temperatures is an active research problem that impacts engineering applications from magnetic resonance imaging systems to the Large Hadron Collider. To assess autofocusing in a more realistic scenario, we used a dataset comprising superconducting materials paired with their critical temperatures Hamidieh (2018), the maximum temperature at which a material exhibits superconductivity. Each material is represented by a feature vector of length eighty-one, which contains real-valued properties of the material’s constituent elements (e.g., their atomic radius and valence). We outline our experiments here, with details deferred to the Supplementary Material section S4.

Median Max PCI RMSE Median Max PCI RMSE
Original 51.5 103.8 0.11 0.05 17.2 57.0 98.4 0.11 0.01 29.6
Autofocused 76.4 119.8 3.78 0.56 12.9 78.9 111.6 4.4 0.01 24.5
Mean Diff. 24.9** 16.0** 3.67** 0.51** -4.4** 21.9** 13.2** 4.2** 0.01 -5.1*
Original 43.4 102.0 0.05 0.92 7.4 49.2 100.6 0.14 0.09 17.5
Autofocused 71.4 114.0 1.60 0.65 12.7 64.2 111.6 0.86 0.49 11.1
Mean Diff. 28.0** 12.0** 1.56** -0.27** 5.4** 15.0** 11.0** 0.73** 0.40** -6.4**
Original 34.5 55.8 0.00 -0.16 148.3 42.1 69.4 0.00 0.27 27493.2
Autofocused 67.0 98.0 1.69 0.13 29.4 50.2 85.8 0.01 0.52 9499.8
Mean Diff. 32.6** 42.3* 1.69* 0.29 -118.9** 8.1* 16.3* 0.01 0.25* -17993.5*
Table 1: Designing superconducting materials. We ran six different MBO methods, each with and without autofocusing. For each method, we extracted those samples with oracle expectations above the percentile and computed their ground-truth expectations. We report the median and maximum of those ground-truth expectations (both in degrees K), their percent chance of improvement (PCI, in percent) over the maximum label in the training data, as well as the Spearman correlation () and root mean squared error (RMSE, in degrees K) between the oracle and ground-truth expectations. Each reported score is averaged over trials, where, in each trial, a different training set was sampled from the training distribution. “Mean Diff.” is the average difference between the score when using autofocusing compared to not. Bold values with one star (*) and two stars (**), respectively, mean -values and from a two-sided Wilcoxon signed-rank test on the paired score differences between a method with autofocus and without (’Original’). For all scores but RMSE, a higher value means autofocusing yielded better results (as indicated by the arrow ); for RMSE, the opposite is true (as indicated by the arrow ).

Unlike in silico validation of a predictive model, one cannot hold out data to validate a design algorithm because one will not have ground-truth labels for proposed design candidates. Thus, similarly to Brookes et al. (2019), we created a “ground-truth” model by training gradient-boosted regression trees Hamidieh (2018); Stanev et al. (2018) on the whole dataset and treating the output as the ground-truth expectation, , which can be called at any time. Next, we generated training data to emulate the common scenario in which design practitioners have labeled data that are not dense near ground-truth global optima. In particular, we selected the training points from the dataset whose ground-truth expectations were in the bottom percentile. We used MLE with these points to fit a full-rank multivariate normal, which served as the training distribution, , from which we drew simulated training points, . For each we drew one sample, , to obtain a noisy ground-truth label. Finally, for our oracle, we used to train an ensemble of three neural networks that output both and , to provide predictions of the form  Lakshminarayanan et al. (2017).

We ran six different MBO algorithms, each with and without autofocusing, with the goal of designing materials with maximal critical temperatures. In all cases, we used a full-rank multivariate normal for the search model, and flattened the importance weights used for autofocusing to Sugiyama et al. (2007) with to help control variance. The MBO algorithms were: (i) Conditioning by Adaptive Sampling (CbAS) Brookes et al. (2019); (ii) Design by Adaptive Sampling (DbAS) Brookes et al. (2019); (iii) reward-weighted regression (RWR) Peters and Schaal (2007); (iv) the “feedback” mechanism proposed in Gupta and Zou (2019) (FB); (v) the cross-entropy method used to optimize probability of improvement (CEM-PI) Brookes et al. (2019); Snoek et al. (2012); and (vi) Covariance Matrix Adaptation Evolution Strategy (CMA-ES) Hansen (2006). These are briefly described in the Supplementary Material section S4.

To quantify the success of each algorithm, we did the following. At each iteration, , we first computed the oracle expectations, , for each of samples drawn from the search model, . We then selected the iteration where the percentile of these oracle expectations was greatest. For that iteration, we computed various summary statistics on the ground-truth expectations of the best samples, as judged by the oracle from that iteration (i.e., samples with oracle expectations greater than the percentile; Table 1). See Algorithm 3 in the Supplementary Material for pseudocode of this procedure. Our evaluation procedure emulates the typical setting in which a practitioner has limited experimental resources, and can only evaluate the ground truth for the most promising candidates Wu et al. (2019); Bedbrook et al. (2019); Mansouri Tehrani et al. (2018); Biswas et al. (2020).

Across the majority of evaluation metrics, for all MBO methods, autofocusing a method provided a statistically significant improvement upon the original method. The percent chances of improvement (PCI, the percent chances that the best samples had greater ground-truth expectations than the maximum label in the training data), expose the challenging nature of the design problem. All methods with no autofocusing had a PCI less than , which although small, still reflects a marked improvement over a naive baseline of simply drawing new samples from the training distribution itself, which achieves . Plots of design trajectories from these experiments, and results from experiments without variance control and with oracle architectures of higher and lower capacities, can be found in the Supplementary Material (Figures S3 and S4, Table S2).

6 Discussion

We have introduced a new formulation of oracle-based design as a non-zero-sum game. From this formulation, we developed a new approach for design wherein the oracle—the predictive model that replaces costly and time-consuming laboratory experiments—is iteratively retrained so as to “autofocus” it on the current region of design candidates under consideration. Our autofocusing approach can be applied to any design procedure that uses model-based optimization. We recommend using autofocusing with an MBO method that uses trust regions, such as CbAS Brookes et al. (2019), which automatically helps control the variance of the importance weights used for autofocusing. For autofocusing an MBO algorithm without a trust region, practical use of the oracle gap certificate and/or effective sample size should be further investigated. Nevertheless, even without these, we have demonstrated empirically that autofocusing can provide benefits.

Autofocusing can be seen as dynamically correcting for covariate shift as the design procedure explores design space. It can also be understood as enabling a design procedure to navigate a trade-off between the bias and variance of the oracle, with respect to the search model distribution. One extension of this idea is to also perform oracle model selection at each iteration, such that the model capacity is tailored to the level of importance weight variance.

Further extensions to consider are alternate strategies for estimating the importance weights Sugiyama et al. (2012). In particular, training discriminative classifiers to estimate these weights may be fruitful when using search models that are implicit generative models, or whose likelihood cannot otherwise be computed in closed form, such as variational autoencoders Sugiyama et al. (2012); Grover et al. (2019). We believe this may be a promising approach for applying autofocusing to biological sequence design and other discrete design problems, which often leverage such models. One can also imagine extensions of autofocusing to gradient-based design procedures Killoran et al. (2017)—for example, using techniques for test-time oracle retraining, in order to evaluate the current point most accurately Sun et al. (2019).


Many thanks to Sebastián Prillo, David Brookes, Chloe Hsu, Hunter Nisonoff, Akosua Busia, and Sergey Levine for helpful comments on the work. We are also grateful to the U.S. National Science Foundation Graduate Research Fellowship Program and the Chan Zuckerberg Investigator Program for funding.

7 Broader Impact

If adopted more broadly, our work could affect how novel proteins, small molecules, materials, and other entities are engineered. Because predictive models are imperfect, even with the advances presented herein, care should be taken by practitioners to verify that any proposed design candidates are indeed safe and ethical for the intended downstream applications. The machine learning approach we present facilitates obtaining promising design candidates in a cost-effective manner, but practitioners must follow up on candidates proposed by our approach with conventional laboratory methods, as appropriate to the application domain.


Supplementary Material


S1 Pseudocode

Algorithm 1 gives pseudocode for autofocusing a broad class of model-based optimization (MBO) algorithms known as estimation of distribution algorithms (EDAs), which can be seen as performing Monte-Carlo expectation-maximization Brookes et al. (2020). EDAs proceed at each iteration with a sampling-based “E-step” (Steps 1 and 2 in Algorithm 1) and a weighted maximum likelihood estimation (MLE) “M-step” (Step 3; see Brookes et al. (2020) for more details). Different EDAs are distinguished by method-specific monotonic transformations , which determine the sample weights used in the M-step. In some EDAs, this transformation is not explicitly defined, but rather implicitly implemented by constructing and using a sequence of relaxed constraint sets, , such that  Rubinstein (1997, 1999); Brookes et al. (2019).

Algorithm 2 gives pseudocode for autofocusing a particular EDA, Conditioning by Adaptive Sampling (CbAS) Brookes et al. (2019), which uses such a sequence of relaxed constraint sets, as well as M-step weights that induce an implicit trust region for the search model update. For simplicity, the algorithm is instantiated with the design goal of maximizing the property of interest. It can easily be generalized to the goal of achieving a specific value for the property, or handling multiple properties (see Sections S2-3 of Brookes et al. (2019)).

Use of [.] in the pseudocode denotes an optional input argument with default values.

Input :  Training data, ; oracle model class, with parameters, , that can be estimated with MLE; search model class, , with parameters, , that can be estimated with weighted MLE or approximations thereof; desired constraint set, (e.g., ); maximum number of iterations, ; number of samples to generate, ; EDA-specific monotonic transformation, .
Initialization :  Obtain by fitting to with the search model class. For the search model, set . For the oracle, , use MLE with equally weighted training data.
       for  do
  • 1.


    Sample from the current search model, .

  • 2.



  • 3.


    Fit the updated search model, , using weighted MLE with the samples, , and their corresponding EDA weights, .

  • 4.


    Compute importance weights for the training data, .

  • 5.


    Retrain the oracle using the re-weighted training data,

    Output : Sequence of search models, , and sequence of samples, , from all iterations. One may use these in a number of different ways. For example, one may sample design candidates from the final search model, , or use the most promising candidates among , as judged by the appropriate oracle (i.e., corresponding to the iteration at which a candidate was generated).
    Algorithm 1 Autofocused model-based optimization algorithm
    Input :  Training data, ; oracle model class, with parameters, , that can be estimated with MLE; search model class, , with parameters, , that can be estimated with weighted MLE or approximations thereof; maximum number of iterations, ; number of samples to generate, ; [percentile threshold, ].
    Initialization :  Obtain by fitting to with the search model class. For the search model, set . For the oracle, , use MLE with equally weighted training data. Set .
           for  do
  • 1.


    Sample from the current search model, .

  • 2.


  • 3.


  • 4.


  • 5.


    Fit the updated search model, , using weighted MLE with the samples, , and their corresponding EDA weights, .

  • 6.


    Compute importance weights for the training data, .

  • 7.


    Retrain the oracle using the re-weighted training data,

    Output : Sequence of search models, , and sequence of samples, , from all iterations. One may use these in a number of different ways (see Algorithm 1).
    Algorithm 2 Autofocused Conditioning by Adaptive Sampling (CbAS)
    Input :  Sequence of samples, , from each iteration of an MBO algorithm; their oracle expectations, ; [percentile threshold, ].
           for  do
                 Compute and store .
           (pick the best iteration) (pick best samples from best iteration)
    Output : 
    Algorithm 3 Procedure for evaluating MBO algorithms in superconductivity experiments. For each MBO algorithm in Tables 1, S2, S3, and S4, the reported scores were the outputs of this procedure, averaged over trials. Recall that denotes the expectation of the oracle model at iteration , while denotes the ground-truth expectation.

    S2 Proofs, derivations, and supplementary results

    Proof of Proposition 1..

    For any distribution , if


    then it holds that


    where is the total variation distance between probability distributions and , and the second inequality is due to Pinsker’s inequality. Finally, applying Jensen’s inequality yields


    s2.1 Derivation of the descent step to minimize the oracle gap

    Here, we derive the descent step of the alternating ascent-descent algorithm described in section 3.2. At iteration , given the search model parameters, , our goal is to update the oracle parameters according to


    Note that


    We cannot query the ground truth, , but we do have labeled training data, , where and by definition. We therefore leverage importance sampling, using as the proposal distribution, to obtain


    Finally, we instantiate an importance sampling estimate of the objective in Equation 16 with our labeled training data, to get a practical oracle parameter update,


    This update is equivalent to fitting the oracle parameters, , by performing weighted MLE with the labeled training data, , and corresponding weights, , where .

    s2.2 Variance of importance weights

    The importance-sampled estimate of the log-likelihood used to retrain the oracle (Equation 17) is unbiased, but may have high variance due to the variance of the importance weights. To assess the reliability of the importance-sampled estimate, alongside the effective sample size described in section 3.3, one can also monitor confidence intervals on some loss of interest. Let denote a pertinent loss function induced by the oracle parameters, , (e.g., the squared error ). The following observation is due to Chebyshev’s inequality.

    Proposition S2.1.

    Suppose that is a bounded loss function, such that for all , and that . Let be labeled training data such that the are drawn independently and for each . For any and any , with probability at least it holds that


    where is the exponentiated Rényi- divergence, i.e., .


    We use the following lemma to bound the variance of the importance sampling estimate of the loss. Chebyshev’s inequality then yields the desired result. ∎

    Lemma S2.1 (Adaptation of Lemma 4.1 in Metelli et al. (2018) Metelli et al. (2018)).

    Under the same assumptions as Proposition S2.1, the joint distribution is absolutely continuous with respect to the joint distribution . Then for any , it holds that


    One can use Proposition S2.1 to construct a confidence interval on, for example, the expected squared error between the oracle and the ground-truth values with respect to , using the labeled training data on hand. The Rényi divergence can be estimated using, for example, the plug-in estimate . While the bound, , on may be restrictive in general, for any given application one may be able to use domain-specific knowledge to estimate . For example, in designing superconducting materials with maximized critical temperature, one can use an oracle architecture whose outputs are non-negative and at most some plausible maximum value (in degrees Kelvin) according to superconductivity theory; one could then take for squared error loss. Computing a confidence interval at each iteration of a design procedure then allows one to monitor the error of the retrained oracle.

    Monitoring such confidence intervals, or the effective sample size, is most likely to be useful for design procedures that do not have in-built mechanisms for restricting the movement of the search distribution away from the training distribution. Various algorithmic interventions are possible—one could simply terminate the procedure if the error bounds, or effective sample size, surpass some threshold, or one could decide not to retrain the oracle for that iteration. For simplicity and clarity of exposition, we did not use any such interventions in this paper, but we mention them as potential avenues for further improving autofocusing in practice. Note that 1) the bound in Proposition S2.1 is only useful if the importance weight variance is finite, and 2) estimating the bound itself requires use of the importance weights, and thus may also be susceptible to high variance. It may therefore be advantageous to use a liberal criterion for any interventions.

    CbAS naturally controls the importance weight variance.

    Design procedures that leverage a trust region can naturally bound the variance of the importance weights. For instance, CbAS Brookes et al. (2019), developed in the context of an oracle with fixed parameters, , proposes estimating the training distribution conditioned on as the search model:


    where . This prescribed search model yields the following importance weight variance.

    Proposition S2.2.

    For , it holds that


    That is, so long as has non-neglible mass under data drawn from the training distribution, , we have reasonable control on the variance of the importance weights. Of course, if is too small, this bound is not useful, but to have any hope of successful data-driven design it is only reasonable to expect this quantity to be non-negligible. This is similar to the experimental requirement, in directed evolution for protein design, that the initial data exhibit some “minimal functionality” with regards to the property of interest Yang et al. (2019).


    The variance of the importance weights can be written as


    where is the exponentiated Rényi- divergence. Then we have


    where the second equality is due to the property in Example 1 of van Erven and Harremos (2014). ∎

    This variance yields the following expression for the population version of the effective sample size:


    Furthermore, CbAS proposes an iterative procedure to estimate . At iteration , the algorithm seeks a variational approximation to , where . Since , the expressions above for the importance weight variance and effective sample size for the final search model prescribed by CbAS translate into upper and lower bounds, respectively, on the importance weight variance and effective sample size for the distributions prescribed at each iteration.

    S3 An illustrative example

    s3.1 Experimental details

    Ground truth and oracle.

    For the ground-truth function , we used the sum of the densities of two Gaussian distributions, and . For the expectation of the oracle model, , we used kernel ridge regression with a radial basis function kernel as implemented in scikit-learn, with the default values for all hyperparameters. The variance of the oracle model, , was set to the mean squared error between and the training data labels, as estimated with -fold importance-weighted cross-validation when autofocusing Sugiyama et al. (2007).

    MBO algorithm.

    We used CbAS as follows. At iteration , similar to Brookes et al. (2019), we used the relaxed constraint set where was the percentile of the oracle expectation, , when evaluated over . At the final iteration, , the constraint set is equivalent to the design goal of maximizing the oracle expectation, , which is the oracle-based proxy to maximizing the ground-truth function, . At each iteration, we used numerical quadrature (scipy.integrate.quad) to compute the search model,


    Numerical integration enabled us to use CbAS without a parametric search model, which otherwise would have been used to find a variational approximation to this distribution Brookes et al. (2019). We also used numerical integration to compute the value of the model-based design objective (Equation 1) achieved by the final search model, both with and without autofocusing.

    s3.2 Additional plots and discussion

    For all tested settings of the variance of the training distribution, , and the variance of the label noise, , autofocusing yielded positive improvement to the model-based design objective (Equation 1) on average over trials (Figure 2). For a more comprehensive understanding of the effects of autofocusing, here we pinpoint specific trials where autofocusing decreased the objective, compared to a naive approach with a fixed oracle. Such trials were rare, and occurred in regimes where one would not reasonably expect autofocusing to provide a benefit. In particular, as discussed in section 5.1, such regimes include when is too small, such that training data are unlikely to be close to the global maximum, and when is too large, such that the training data already include points around the global maximum and a fixed oracle should be suitable for successful design. Similarly, when the label noise variance, , is too large, the training data are no longer informative and no procedure should hope to perform well systematically. We now walk through each of these regimes.

    When was small and there was no label noise, we observed a few trials where the final search model placed less mass under the global maximum with autofocusing than without. This effect was due to increased standard deviation of the autofocused oracle, induced by high variance of the importance weights (Figure 0(a)). When was small and was extremely large, a few trials yielded lower final objectives with autofocusing by insignificant margins; in such cases, the label noise was overwhelming enough that the search model did not move much anyway, either with or without autofocusing (Figure 0(b)). Similarly, when was large and there was no label noise, a few trials yielded lower final objectives with autofocusing than without, by insignificant margins (Figure 0(c)).

    (a) Example trial with low-variance training distribution and no label noise, .
    (b) Example trial with low-variance training distribution and high label noise, .
    (c) Example trial with high-variance training distribution and no label noise .
    (d) Example trial with high-variance training distribution and high label noise .
    Figure S1: Examples of regimes where autofocus (AF) sometimes yielded lower final objectives than without (non-AF). Each row shows snapshots of CbAS in a different experimental regime, from initialization (leftmost panel), to an intermediate iteration (middle panel), to the final iteration (rightmost panel). As in Figure 1, the vertical axis represents both values (for the oracle and ground truth) and probability density values (of the training distribution, , and search distributions, ). Shaded envelopes correspond to standard deviation of the oracles, , with the oracle expectations, , shown as a solid line. Greyscale of training points corresponds to their importance weights used in autofocusing. In the rightmost panels, for easy visualization of the final search models achieved with and without AF, the stars and dotted horizontal lines indicate the ground-truth values corresponding to the points of maximum density.

    Interestingly, when the variances of both the training distribution and label noise were high, autofocusing yielded positive improvement for all trials. In this regime, by encouraging the oracle to fit most accurately to the points with the highest labels, autofocusing resulted in search models with greater mass under the global maximum than the fixed-oracle approach, which was more influenced by the extreme label noise (Figure 0(d)).

    As discussed in section 5.1, in practice it is often the case that 1) practitioners can collect reasonably informative training data for the application of interest, such that some exceptional examples are measured (corresponding to sufficiently large ), and 2) there is always label noise, due to measurement error (corresponding to non-zero ). Thus, we expect many design applications in practice to fall in the intermediate regime where autofocusing tends to yield positive improvements over a fixed-oracle approach (Figure 2, Table 1).

    S4 Designing superconductors with maximal critical temperature

    s4.1 Experimental details

    Figure S2: Training distribution and initial oracle for designing superconductors. Simulated training data were generated from a training distribution, , which was a multivariate Gaussian fit to data points with ground-truth expectations below the percentile. The left panel shows histograms of the ground-truth expectations of these original data points, and the ground-truth expectations of simulated training data. The right panel illustrates the error of an initial oracle used in the experiments, by plotting the ground-truth and predicted labels of test points drawn from the training distribution. The RMSE here was .


    Each of the materials in the superconductivity data from Hamidieh (2018) is represented by a vector of eighty-one real-valued features. We zero-centered and normalized each feature to have unit variance.

    Ground-truth model.

    To construct the model of the ground-truth expectation, , we fit gradient-boosted regression trees using xgboost and the same hyperparameters reported in Hamidieh (2018), which selected them using grid search. The one exception was that we used trees instead of trees, which yielded a hold-out root mean squared error (RMSE) of compared to the hold-out RMSE of reported in Hamidieh (2018). To remove collinear features noted in Hamidieh (2018), we also performed feature selection by thresholding xgboost’s in-built feature weights, which quantifies how many times a feature is used to split the data across all trees. We kept the sixty most important features according to this score, which decreased the hold-out RMSE from when using all the features to . The resulting input space for design was then .

    Training distribution.

    To construct the training distribution, we selected the points from the dataset whose ground-truth expectations were below the percentile (equivalent to degrees Kelvin, compared to the maximum of degrees Kelvin in the full dataset). We used MLE with these points to fit a full-rank multivariate normal, which served as the training distribution, , from which we drew simulated training points, , for each trial. For each we drew one sample, , to obtain a noisy ground-truth label. This training distribution produced simulated training points with a distribution of ground-truth expectations, , reasonably comparable to that of the points from the original dataset (Figure S2, left panel).


    For the oracle, we trained an ensemble of three neural networks to maximize log-likelihood according to the method described in Lakshminarayanan et al. (2017) (without adversarial examples). Each model in the ensemble had the architecture