Recent advances in accelerated discovery through machine learning and statistical inference

Recent advances in accelerated discovery through machine learning and statistical inference

Ryan B. Jadrich McKetta Department of Chemical Engineering, University of Texas at Austin, Austin, Texas 78712, USADepartment of Physics, University of Texas at Austin, Austin, Texas 78712, USAThese authors contributed equally to this review.    Beth A. Lindquist McKetta Department of Chemical Engineering, University of Texas at Austin, Austin, Texas 78712, USADepartment of Physics, University of Texas at Austin, Austin, Texas 78712, USAThese authors contributed equally to this review.    Thomas M. Truskett111Corresponding author. McKetta Department of Chemical Engineering, University of Texas at Austin, Austin, Texas 78712, USADepartment of Physics, University of Texas at Austin, Austin, Texas 78712, USAThese authors contributed equally to this review.

Recent applications of machine learning and statistical inference provide case studies demonstrating how such approaches can accelerate the discovery process in physical chemistry and related fields. Examples discussed in this review include the introduction of automated approaches to systematically improve experimental design, increase the efficiency of computationally expensive molecular simulations, facilitate construction of predictive models for complex biological processes, and discover interparticle potentials that lead to materials which meet specified design goals. A common theme is the synergy between experiment and computation enabled by such approaches.

1 Introduction

Modern computation is integral to research in the physical sciences. In the present “Age of Big Data”, big_data (); big_data_2 (); petabytes_high_energy () there are increasing opportunities for computational research to facilitate scientific discovery via Artificial Intelligence (AI). Historically, much of the academic work in AI was beset by disappointment, with early promises left unfulfilled, culminating in a steep decline in funding for AI research in the 1970s called the “AI winter”. lighthill (); ai_book (); tumultuous_search () In some ways, the overly optimistic predictions of the 1960s have still not come to pass (e.g., the sentient computer HAL 9000 from Arthur C. Clarke’s Space Odyssey). space_odyssey () However, there have also been compelling successes, including the victory of Google’s “AlphaGo” over a professional human opponent, that have been fueled by a resurgence of interest in AI. alpha_go_bbc (); alpha_go_science (); alpha_go_intro () Current work in the field is often guided by a practical data-driven goal: the ability to automatically extract specific insights from (or make predictions based on) large data sets. ai_book (); bayesian_reasoning (); statistical_learning () Thus, while sentient robots may not exist, more subtle achievements have been realized including the introduction of powerful machine learning (ML) and statistical inference (SI) approaches. Such methodologies have the potential to be exploited in the physical sciences in order to 1) extract key insights, 2) make research processes more systematic, and 3) increase research efficiency. This review details recent progress towards this end.

Both ML and SI take as input an array of feature variables and attempt to fit some model to a known “training” set of data. Fitting the model involves “learning” the values of the tunable model parameters via a systematic optimization technique. bayesian_reasoning (); statistical_learning () The distinction between ML and SI that we adopt in this work lies in how the model is used. Typically, ML employs powerful black-box models to accurately predict some unknown quantity of interest that is believed to be dependent on the features. In model fitting, the training data is supplemented with prior observations of the relationship between features and the unknown quantity. For SI, the goal is typically to model the underlying process by which some set of features is generated. The process that generates the observables of interest is assumed to be representable by a probability distribution with parameters that are “learned” or “inferred”.

ML and SI are not wholly new to the physical sciences. For example, widely used regression analysis methods are fundamentally ML methods. More recently though, increasingly advanced ML and SI strategies are being leveraged to solve a variety of challenging problems. In this review, we first describe the use of ML predictions to expedite the iterative experimental design process of synthesis and measurement for maximizing a quantifiable design goal. alloy_design (); another_alloy_design (); phase_boundary_design (); using_uncertainties () By using ML-based methods, the results of previous experiments can be used to predict the most promising new sample to synthesize or new experiment to perform. In another study, ML is leveraged to expedite the computation of quantum mechanical (QM) forces for use in molecular dynamics (MD) simulations. molecular_dynamics_1 (); molecular_dynamics_2 (); force_fields () Prior QM calculations are used to create a predictive ML model relating configurations to QM forces.

Applications of SI in physical chemistry have also emerged to address a variety of challenges. First, we review the recent application of probabilistic statistical mechanical approaches for modeling complex biological phenomena. Application to virus models yields insight into how various residues interact, which could potentially expedite vaccine design. hiv_model (); hiv_model_confirmation (); hepc_model (); error_catastrophe () Similarly, SI methods have been employed to extract intra-cellular interaction networks on the basis of DNA microarray data. gene_interactions () Finally, the review ends with discussion of two new inverse design strategies based on SI for the systematic discovery of materials which either exhibit complex self-assembly behavior clusters_pores_crystals (); crystals (); pores (); many_pores (); clusters () or possess a desired material property design_engines (). Both methods learn the requisite interactions via relatively simple optimization schemes.

2 Acceleration of conventional optimization and simulation schemes

A straightforward application of ML methodology is the prediction of a quantity of interest, , as a function of known input conditions, , called features; here, indicates a set or vector of values of some unspecified size. Generally, is difficult or costly to evaluate. bayesian_reasoning (); statistical_learning (); bayes_thesis () Therefore, accurate predictions are useful to save time and resources. For so-called supervised ML strategies, a predictive model, , is fit via flexible parameters to the prior observations of the correspondence between and , i.e., the training data, . The model allows for prediction of at not included in the training set. Uncertainties associated with the predictions for can be also obtained to yield a probabilistic model for the data, . The uncertainties can either be directly obtained via ML approaches (e.g., Gaussian process models) bayesian_reasoning (); statistical_learning (); bayes_thesis (); bayes_opt_kriging_believer (), or computed via bootstrapping statistical_learning (), i.e., re-sampling the data to generate new model fits that probe the sensitivity of the model to fluctuations in . For either approach, additional sampling both improves the prediction and reduces the uncertainty in .

Supervised ML methods essentially amount to interpolation strategies, each with different underlying assumptions regarding function form, smoothness and possibly the degree of interaction among the features. For example, the popular Gaussian process model requires a “kernel” that governs how correlations between two points in feature space decay, the choice of which influences the smoothness and uncertainty of the predicted function. bayesian_reasoning (); statistical_learning (); bayes_thesis (); bayes_opt_kriging_believer () Another important consideration for ML is the construction of the features. It is oftentimes advantageous to purposefully engineer features to reflect the underlying physics of the problem, instead of requiring the model to learn these relationships via sampling. Feature engineering is non-unique and requires careful consideration of the problem at hand but can ultimately lead to better convergence of the model. bayesian_reasoning (); statistical_learning () Examples of some feature choices will be discussed in the next two subsections, which are organized as follows. In Sect. 2.1, we consider how prediction of the figure of merit (or fitness) associated with a set of tunable conditions can be used to accelerate certain optimization problems. In Sect. 2.2, we discuss the use of ML techniques to approximate QM forces for use in MD simulations.

2.1 Optimization

There are myriad optimization schemes that can be used to maximize a desired property, termed the figure of merit (FOM), by tuning a set of adjustable parameters, bayesian_reasoning (); statistical_learning (); numerical_recipes () Most require the use of derivatives with respect to . However, there are also many cases where derivatives are not easily obtained, e.g., when analytical derivatives are not available (or do not exist) and there are too many adjustable parameters for numerical differentiation to be practical. For such cases, heuristic optimization strategies such as simulated annealing or genetic algorithms have found widespread use. numerical_recipes () In these methods, updates to parameters are governed by a tension between increasing the FOM and promoting a wider exploration of parameter space through stochastic processes. As a result, a large number of FOM evaluations can be required to arrive at a solution, limiting the practicality of such strategies, especially for FOM evaluations that require expensive computations or experiments. To handle such cases, ML techniques can be employed in order to generate parameter updates that are informed by prior FOM evaluations as opposed to stochastic processes that only consider possible changes to the current FOM.

Bayesian optimization (BO) represents one such approach, whereby a probability distribution, , is constructed to model the true underlying FOM function or landscape, , by fitting to prior observations (i.e., above) via the tunable model parameters (). bayes_thesis (); bayes_opt_kriging_believer () provides information on regions in parameter space that are 1) highly likely (determined via prior FOM evaluations) to possess high values and 2) those of low certainty but large positive uncertainty in . The former can be exploited to further refine the best-known parameters, and the latter represent profitable areas for exploration–allowing the optimization to progress to other maxima as needed. As more observations are included, progressively better predictions for are generated.

After construction of , the model must be translated into a prediction for a set of parameters for the next optimization step. bayes_thesis (); bayes_opt_kriging_believer () Because this mapping is non-unique, multiple options exist for this task, each providing a particular balance of exploitation and exploration. One such choice for a recommender function, , is called Expected Improvement


where is often taken as the best FOM evaluation at any prior bayes_opt_kriging_believer () Only regions of uncertainty that are better than the current optimal set of parameters are included, i.e., uncertainty in the negative direction is not penalized. Including the uncertainty associated with the predictions in the recommender function has been shown to enhance the optimization efficacy. using_uncertainties () As the optimization progresses, uncertainty will decrease in regions of parameter space that are sampled, perhaps directing the optimization to explore areas with greater uncertainty.

BO can be generalized to suggest the most promising “batch” of samples, of size bayes_thesis (); bayes_opt_kriging_believer () In computational research, may be the number of parallel processes that can be run, while in experiment, might be governed by the sample auto-feeder size, for instance. One generalization adopted in the work below is called the Kriging believer algorithm, where the model is iteratively updated via predictions for at suggested within the same batch (i.e., “believe” the model). Once actual measurements are taken, the measured FOM values replace the predicted ones in the data set.

One recent study integrated BO into the experimental design of a NiTi-based shape memory alloy (SMA) with minimized thermal hysteresis. alloy_design () SMAs exhibit shape memory and super-elasticity as the result of an underlying high-temperature austenite and low-temperature martensite phase transition. The temperature at which the transition occurs depends on whether the material is being heated or cooled, yielding a hysteresis temperature gap, . Larger values for result in greater performance degradation–motivating the chosen FOM definition: . Optimization targeted an alloy of the form where , , and are doping elements allowed in increments of within the constraints , , and –yielding 797,504 possible combinations. The combinatorial challenge combined with the expensive FOM evaluation (experimental synthesis) called for a Bayesian approach.

Figure 1: Iterative design strategy for minimizing the hysteresis of an alloy. Beginning with a set of training samples to initialize the inference procedure, new samples that hold promise are predicted, synthesized, and the associated hysteresis measured; the new results are then used to augment the training set before beginning the cycle anew. Figure reproduced from alloy_design ().

BO, like all ML approaches, requires defining the features () to be used as input to the ML algorithm. While , and are the tunable experimental parameters, their effect on is mediated by their impact on various atomic properties such as valence electron number. Therefore, the authors represented features via weighted fractions (e.g., ) of the atomic properties listed in the second column of Fig. 1. Each property is a coarse-grained electronic structure measure known to influence the phase transition temperatures and hysteresis. While the ML model might eventually learn these relationships with sufficient sampling, directly encoding them is likely to improve the convergence rate of the model.

The overall iterative design strategy for alloy discovery is summarized in Fig. 1. The initial was composed of 22 samples, which for each, differential scanning calorimetry measured . From these samples, the features were computed and used as input to the probabilistic model, from which a batch of new unexplored alloys with the highest expected improvements were drawn. The FOMs () associated with the new parameter combinations were experimentally determined, and these data points were then used to update the probabilistic model, from which new, more accurate, parameter combinations were extracted in an iterative fashion. After only nine iterations of batch updates, 14 out of the 36 total new alloys posses a , which was the best of the initial 22 training samples. The outcome of the BO was a improvement over the best initial training sample; the alloy with was discovered.

In conclusion, BO is a promising avenue for assisting in experimental (or computational) design. The approach is general and can be adapted to accommodate other scenarios. For example, the same authors recently leveraged BO for the accelerated search of BaTiO-based piezoelectrics to minimize the curvature of a phase boundary. phase_boundary_design () Utilizing various descriptive features, prior phase diagrams, and the predicted functional form (from Landau theory) for the phase boundaries, materials with better piezoelectric properties than any in the initial training set were discovered. Beyond hard materials, the field of soft materials design could benefit from BO as 1) database knowledge is limited, and 2) experiments are often highly customized and expensive. Conveniently, many simplified theories exist for soft materials, soft_matter_book_1 (); soft_matter_book_2 () the predictions of which can be incorporated into the optimization as features to embed prior knowledge and improve model convergence.

2.2 Molecular Simulation

MD simulations require many interparticle force evaluations especially for systems with large numbers of particles, motivating the development of classical force fields. Such models invoke a selected functional form (e.g., a quadratic function for a chemical bond) with flexible model parameters (e.g., the associated spring constant) that are fit to reference data, which might be from either QM calculations or experiments. However, classical force fields are not always appropriate; an archetypal failure is that a harmonic spring cannot model bond-breaking processes. Ideally for such cases, QM data would be used to determine the forces for MD simulations; however, QM calculations that are sufficiently accurate to produce meaningful MD simulations are often too computationally expensive.

A ML model that is able to predict QM forces for a given particle arrangement would significantly reduce the computational expense of MD simulations with QM forces. Such a strategy would be particularly amenable to standard ML approaches due to the high degree of structural similarity between many of the configurations that arise in a given MD simulation. In general, proper sampling in MD studies results in an ensemble of configurations possessing similar structural correlations. Therefore, ML models can be employed to learn the complex relationships between QM forces and the configurations for some initial set of structures that initialize the model (i.e., the training data); the resulting model can then be used to predict QM forces. molecular_dynamics_1 (); molecular_dynamics_2 (); force_fields ()

The above scheme is summarized in Fig. 2. In Fig. 2a, the energy trace of a sample trajectory is shown; green regions indicate steps for which QM calculations are needed. In this case, the model is initialized using the early steps of the trajectory until sufficient sampling occurs such that an accurate predictive model can be constructed, at which point ML is used to predict the QM forces (the peach colored region). Later, a large fluctuation–signaled by the jump in the energy–produces a configuration that is outside of scope of the training data; see Fig. 2b. At this point, explicit QM calculations are performed to augment the training data until this new area is “learned” sufficiently well that the ML predictor can be used again. In the future, large fluctuations that span these two domains can be treated using the augmented ML model. Fig. 2c depicts this procedure as a flowchart.

Figure 2: Scheme to augment QM MD simulations with machine learning. (a) A cartoon of an MD energy trace demonstrating the switching behavior between ML predictions and full QM calculations. (b) Qualitative viewpoint of the configuration space sampled during the MD run and the use of QM calculations to expand the knowledge base on-the-fly as uncertain configurations are encountered. (c) Flowchart for the hybrid QM MD scheme. Figure reproduced with permission from molecular_dynamics_1 ().

The above approach was employed to perform MD simulations of atomic Aluminum (Al) using density functional theory (DFT) to compute the QM forces for the training data. molecular_dynamics_1 (); molecular_dynamics_2 () Intuitively, the features–the known quantities from which the forces are predicted–must depend on the environment of the selected particle, in this case the other Al atoms in the simulation. In order to explicitly encode the physics of the problem, atomic coordinates were mathematically transformed into features that are invariant to any rotation, translation or permutation of identical particles. Additionally, the forces on a given Al atom should most strongly depend on its closest neighbors, with more distant Al atoms contributing less strongly. Therefore, both a damping function and a cutoff distance were included in the features to encode the primarily local effects of neighboring particles on the forces.

Inputting these engineered features into a standard ML model enabled quick and accurate force prediction. The errors in ML forces were found to be within the expected accuracy of DFT itself while reducing the computational expense by multiple orders of magnitude with respect to DFT. Such approaches seem promising and may enable QM MD simulations to reach time and length scales not previously believed to be possible. However, further research is needed to address certain open questions, e.g., how does one detect when the model is outside its regime of predictability? The approach adopted by the authors assumed any feature vector lying outside a hypercube encompassing the training set needed explicit DFT computation. While certainly true, points within the sample region could also warrant explicit computation. One possibility includes modeling not only the expected value but also uncertainty (as done in Bayesian optimization). If the uncertainty exceeds some threshold, an QM computation could be performed instead. Some recent work based on the distance between the new feature set and the closest neighbor in the training data shows promise. force_fields ()

3 Modeling biology with statistical mechanics

In Sect. 2, we reviewed use of ML techniques that prioritize the accuracy of predictions, in general leaving the relationships between the features and the predicted values obscured. In this Section, we consider applications of SI where a primary goal of the modeling is to understand the underlying physics that gives rise to a set of observables, potentially in conjunction with making accurate predictions for a quantity of interest. In the following studies, SI methods are used to construct a model, , for some underlying probability distribution, , where is generally unknown but can be sampled from, e.g., via experimentally obtained statistics.

SI methods have been successfully applied to systems of biological relevance. Here we review two such cases involving 1) viral sequences hiv_model (); hiv_model_confirmation (); hepc_model (); error_catastrophe () and 2) gene expression data gene_interactions (). A common challenge is the large number of features that are characteristic to the problem, namely, where is large. The number of samples required to adequately sample a hyperspace grows exponentially with , colloquially referred to as the “curse of dimensionality”. statistical_learning () In such situations, where it is only possible to sample a small fraction of the possible feature combinations, the data will necessarily be sparse. The problem then is to formulate a model that can take such sparse sampling and make probability predictions for any .

While statistically significant sampling of the full coordinate space, , is challenging in high dimension, it is generally possible to obtain good statistics on quantities that depend on only small subset of coordinates–providing some constraints on . For example, statistically significant empirical calculation of one- and two-point probability distributions, and is often possible. (For example, the radial distribution function, , can be reliably computed from an MD simulation that samples only a minuscule fraction of possible particle arrangements.) Other quantities such as the mean and co-variance may also be reliably estimated. However, enforcing to replicate lower-order correlations is an under-determined problem. A potentially infinite number of can reproduce such lower-order statistics. The challenge is to decide which of the innumerable possibilities for is “best”.

Generally speaking, the best model choice given limited information is the one that is the least predictive (or least “informative”) while reproducing the statistics that are known for jaynes () For example, two of the infinite probability distributions for a six-sided die that are known to yield an average face value of 3.5 are 1) a fair six-sided die (i.e., each face () is equally likely, or for all ), and 2) a die that only turns up a 3 or a 4 in equal proportion ( where ). Intuitively, the for a fair die contains the least information possible–the distribution is maximally “smeared out” across possible states; by contrast, the latter distribution predicts that the face value of the roll will either be a or a . For a more complex constraint though, the “best” may not so intuitive. Therefore, we require a measure quantifying the information content of a distribution .

The maximum entropy principle states that the best model of the many that meet the desired constraints is the one that maximizes the Shannon entropy jaynes (); shannon (); max_cal (); generalized_max_ent (); spin_glasses_and_max_ent ()


where is shorthand for a summation over all possible . In practice, maximum entropy is typically carried out as a two-step protocol. First, Eqn. 2 is analytically maximized subject to the constraints via the technique of Lagrange multipliers. This provides a functional form for the distribution that is parametrized by the undetermined Lagrange multipliers, which are then solved for given the known constraints.

3.1 Viral Fitness Landscapes

Preventative vaccination represents the most realistic way of combating viruses such as Hepatitis C virus (HCV) and human immunodeficiency virus (HIV-I). However, many viruses have high sequence mutability, providing pathways to escape from vaccine-induced immunological pressure. Therefore, effective vaccines might need to account for the entirety of the vast combinatorial “sequence space” available to the virus. Unfortunately, relatively limited amounts of statistical data (from experimental observations of virus sequences) are available in comparison to the high dimensionality of the problem. Therefore, SI strategies based on maximizing the Shannon entropy are appropriate for this task.

The “fitness” associated with a specific sequence is determined by its ability to replicate. While this quantity can be measured by in vitro studies, existing experimental data of this type are sparse. Instead, clinical databases that quantify the frequency of given sequences are used as a proxy for the fitness; these data span a much larger range of sequences. The underlying, physically reasonable, assumption is therefore that “fitter” sequences are more prevalent in the real world hiv_model_confirmation (). The goal of a vaccine is to induce immunological pressure on fit sequences, thus driving the viral population into unfit sequence space where it is minimally damaging to the infected host.

Both HCV and HIV-1 models followed from maximization of the Shannon entropy (Eqn. 2) subject to reproducing the empirical one- and two-point sequence distributions and , respectively, where is the amino acid identity at the location in the considered sequence (). hiv_model (); hiv_model_confirmation (); hepc_model (); error_catastrophe () The model takes on the following statistical mechanical form:


where is the combined set of all one- and two-point dimensionless “energy” functions, and , that need to be determined such that and . Formally, this model is identical to the infinite range Potts spin model in statistical physics. potts_model () It would be possible to match higher-order correlations as well via a multi-body energy expansion; however, the sample size requirements for adequate statistical estimation of correlations increase exponentially with the dimensionality of the correlation quantity. Attempting to exactly reproduce higher-order correlations that have not been adequately sampled will generally result in over-fitting and therefore reduced transferability to data that is not included in the modeling step. Nonetheless, the above models based on one- and two-body correlations constructed for viruses were found to also reproduce three-body correlations, even though these were not explicitly considered for modeling. hepc_model ()

Figure 3: (a) Comparison of in vitro fitness measurements (normalized with respect to the wild-type) for 31 engineered HCV NS5B mutants to maximum entropy model predictions. Overall, qualitatively reasonable agreement is found. (b) The solid black curve is the fictive “heat capacity” as a function of temperature for the HIV P6 protein model, providing a thermodynamic indication of the phase transition. Other heat capacity curves are generated for random spin models not obtained by fitting to the data–indicating the phase transition is a characteristic of the virus, not the modeling approach. (c) Probability for sequences of a particular energy at a specified fictive model temperature in the P6 protein model for HIV. From left to right, in increments of The bi-modal character that emerges at indicates the existence of two populations (high and low energy) that coexist (equal area under the curve)–corresponding to a phase transition in a finite-sized system. Figure adapted from hepc_model (); error_catastrophe ().

The models were validated by a multifaceted analysis which tested the ability of the model to 1) infer sequence fitness, 2) identify viable escape mutations for the virus, and 3) agree with known effective immune responses that target specific regions of the viral sequence. With respect to the first component, one expects measurements of the in vitro fitness, , to be related to the model via if prevalence and fitness are proportional. As shown in Fig. 3a, this relation is found to be reasonably well approximated in HCV–an analogous study showed this to hold for HIV as well. Other more specific tests (such as identifying escape mutations) demonstrated similar success–validating the model and its underlying assumption of fitness and prevalence.

In addition to providing an avenue to deal with the high dimensionality of the problem, the use of a spin model for viral sequences provides the added benefit of interpretability by way of analogy to actual statistical mechanical systems. In modeling HIV-I, a phase transition was observed that might be a consequence of viral “error catastrophe”, a well-known concept where accelerating the mutation rate of the virus to a certain tipping point eventually leads to viral extinction. error_catastrophe () Above the transition, fatal error accumulation leads to an ensemble populated by many high energy (and therefore low fitness) strains. RNA viruses in particular are thought to have evolved to operate in close proximity to the error catastrophe, so as to be maximally mutable–enhancing the ability of viruses to escape the host’s immune response–while remaining effective. rna_error_cat () Purely theoretical models have been developed that display the error catastrophe; however, such models necessarily invoke uncontrolled assumptions about viral replication. error_catastrophe_model () Empirical modeling of viral fitness as described above, on the other hand, is based upon clinical data that implicitly accounts for factors related to viral fitness and replication.

By defining a fictive temperature (), which corresponds to a re-scaling of the energy via in Eqn. 3, a corresponding “heat capacity” was computed for the HIV-I fitness model. The sharp spike shown in Fig. 3b is suggestive of an underlying phase transition. This phase transition was interpreted as a signature of the error catastrophe because it signifies a change in the population preference for low versus high fitness sequences, as can be observed in Fig. 3c. Specifically, below the transition temperature (), the model virus exists as an ensemble of relatively few high fitness (i.e., low energy) strains, with the distribution narrowing and shifting to lowering energies as decreases. Conversely, the model predicts a generally broader and much higher energy distribution above . However, at , there is a co-existence between high and low fitness strains, as evidenced by the bimodality in the black curve in Fig. 3c. Of course, this co-existence between two “states” is also a well-known characteristic of a first-order phase change in systems that obey equilibrium statistical mechanics.

This research provides the first empirical HIV model to corroborate the limited experimental reports supporting the existence of an accessible error catastrophe. It has been proposed that altering the model via tuning corresponds to varying the mutation rate: 1) corresponds to random copying, and 2) corresponds to perfect, high-fidelity replication–a variable which could be exploited in the search for a cure. Interestingly, such effective temperatures may be reachable by mutation rate enhancing drug therapies, and some clinical trials for an HIV mutagen have already begun. error_catastrophe_exp_1 (); error_catastrophe_exp_2 (); error_catastrophe_exp_3 ()

3.2 Maximum entropy inference for complex systems

In the studies above, it might seem surprising that inclusion of only one- and two-body correlations is sufficient to accurately model the many-body relationship between sequence and viral fitness. However, recent work indicates that the complexity inherent to biological systems might actually contribute to the successes of pairwise models in approximating their behavior. pairwise_sufficiency () A prior study of simplified Ising p-spin models p_spin () indicates that the sufficiency of a pairwise model is tied to the entropy of the distribution being approximated. p-spin models describe a set of spins with possible values as follows


where is shorthand for the set of all p-body spin couplings, .

An ensemble of and models were randomly generated–possessing a range of entropies–by varying the number of spins (), the number of non-zero p-body couplings (), and the values of the couplings (). These higher-order models were mapped onto pairwise spin models (), subject to the constraint that the two-body spin probabilities of the pairwise model matched the reduced two-body distributions obtained from the higher-order probability distribution. A schematic representation of this mapping for is shown in Fig. 4a: the top network depicts a three-body spin model with four spins (circles), where all of the 3-body couplings (squares) are chosen to be non-zero. On the bottom is the corresponding pairwise spin model, where the 2-body couplings are represented as hexagons.

Figure 4: (a) Schematic depicting the mapping of a 3-body spin model onto a pairwise spin model for four spins (the circles). The 3-body interactions are represented with squares, and the pairwise interactions are shown as hexagons. Due to the small number of spins, there are more pairwise interactions than higher-order couplings; however, practically useful examples will typically have more spins and therefore possess fewer pairwise interactions. (b-c) The dependence of the Kullback-Leibler divergence on the entropy () for various model p-spin systems (using a base two logarithm, as opposed to the more traditional natural log in Eqn. 2 and Eqn. 5). b) Randomly generated 3-body interactions networks composed of spins are mapped onto one- and two-body reduced models. c) Randomly generated 4-body interactions networks composed of spins are mapped onto a pairwise model. The data are segregated according to a measure of the network “interaction strength” , which is proportional to the product of the standard deviation of the randomly generated spin couplings, , and the (normalized) number of non-zero p-body couplings i.e., . Figure adapted from pairwise_sufficiency ().

In order to evaluate how well the reduced distributions replicated the targets, a measure of similarity between two distributions is required. Various metrics exist for this purpose, many of which have seen renewed interest for their utility in rigorous force-field optimization and systematic coarse-graining. statistical_distances (); relative_entropy (); noid_perspective () For the p-spin study, the authors adopted the popular Kullback-Leibler divergence


which is a measure of the degree to which overlaps with relative_entropy () The KL-divergence is also known as the relative entropy in the field of bio-molecular coarse-graining. relative_entropy ()

Adopting the higher-order model as , Fig. 4b demonstrates that there are two regimes where the higher-order distributions are well-represented by pairwise models (i.e., ): when the entropy of the target is either very low or very high. For high entropy systems, this agreement is trivial: if there is relatively little order, there are not particularly stringent requirements on the (necessarily weak) couplings that can generate such a system. Even one-body, or independent, models work well when the entropy is sufficiently high, as can be seen in the right hand side of Fig. 4b. However, more surprising is the observation that very ordered systems (i.e., ) are also well-represented by pairwise models. Furthermore, pairwise sufficiency was also found to be favored when the higher-order networks are densely interacting–as quantified by the parameter . In Fig. 4c, the ensemble of spin models were first sorted by of the target network and then plotted separately for different ranges in . Fig. 4c shows a notable decrease in as increases.

Taken together, pairwise models are particularly well suited to represent higher-order networks when the latter has low entropy and is densely interacting. Low entropy systems necessarily possess correlations among constituent particles, spins, etc.; the lowest entropy spin system occurs when all spins move coherently (all up or all down), for instance. While such an arrangement may be obtained via many higher-order models, sufficiently strong positive pair interactions between all spins will achieve the same structure. Similarly, finite-sized correlated spin groups (i.e., clusters) can be obtained by employing strong pair interactions between the spins comprising each spin cluster while setting all other pair interactions to significantly smaller values. Interestingly, prior work on random constraint satisfaction anticipates the presence of such large correlated groups–amenable to representation via pair interactions–when the underlying distribution is densely interacting. rcs_1 (); rcs_2 ()

It has been suggested that biological systems can be regarded as evolutionarily derived solutions to complex constraint satisfaction problems spins_evolution (); pairwise_sufficiency ()–ideal candidates for modeling with pairwise statistical models, according to the above analysis. Therefore, pairwise sufficiency may actually be characteristic to many biological systems–not just for the viral sequences reviewed earlier. Indeed, other successes in the modeling of biological processes with pairwise maximum entropy have been demonstrated. spin_glasses_and_max_ent (); max_ent_antibody (); max_ent_ecoli (); max_ent_flocks (); bio_near_criticality () For example, such methods have been used to construct a model for gene expression levels under fluctuating cellular metabolic conditions. gene_interactions () DNA micro-arrays can measure thousands of gene expression levels at a time, resulting in a high-dimensional data set. Similar to the viral sequences above, pairwise entropy maximization provides an avenue to both circumvent the curse of dimensionality and extract pair interactions between genes.

This approach was applied to gene expression data of the well-studied Saccharomyces cerevisiae (i.e., Baker’s yeast) under conditions which facilitate energy metabolic oscillations. gene_oscillations () The resultant fluctuations in transcription levels yield an empirical ensemble of gene expression data. Micro-array data was encoded as an array of variables, , where , is one of continuous, scalar, gene expression levels. Constraints imposed on the maximum entropy model include the empirically observed one and two-body gene expression correlations, and , respectively; this yields a model analogous to Eqn. 4, where both and contributions are present and the spins are continuous.

The benefit of using SI for gene expression data as opposed to a more common correlation-based analysis (e.g., clustering) gene_clustering () is the construction of a network that quantifies the direct interactions between genes, possibly uncovering the driving force behind the observed correlations. Using correlation data alone only indicates which gene levels change together, but not what elements are driving those fluctuations. With the maximum entropy model though, it is easy to identify important governing genes, i.e., the nodes of high connectivity. For example, Fig. 5 has seven nodes with six or more connections, each of which corresponds to an important cellular protein for nutrient signaling.

Furthermore, the model can uncover relationships between genes that are not part of the same cellular process. This can be seen visually in Fig. 5 by the many interactions between genes of different colors, where color indicates the cell process associated with a given gene. Therefore, while the protein expression patterns of S. cerevisiae are relatively well understood and characterized, this need not be the case for the maximum entropy model to discover gene interactions and “hubs” that govern cellular activity and system dynamics within the cell.

Figure 5: Strongest 110 pair interactions (network connections) connecting genes (nodes) in the maximum entropy model. Each of the genes is labeled as well as color coded to indicate an approximate cellular function role. Solid-blue and dashed-red lines indicate positive and negative couplings, respectively, and the seven most-connected nodes (called “hubs”) are circled. Figure reproduced from gene_interactions ().

4 Statistical mechanical inverse design of structure and properties

The discovery of materials possessing a desired attribute is a long-standing design challenge that has been facilitated by computer simulation studies. Forward design–searching through parameter space to find favorable combinations that yield materials with some chosen characteristic–has benefited from recent advances in computational power. janus (); poly_packing (); empty_liquids (); lamellae (); patchy_micelles () However, combinatorial space increases exponentially with dimensionality; therefore, such approaches can still be hampered by the cost associated with exploration of vast swathes of parameter space as well as potential couplings between tunable parameters. Inverse design (ID) methodologies–those characterized by an optimization framework typically used to iteratively update the parameters in a systematic fashion–represent an alternative approach. Inverse methods of statistical mechanics have facilitated the discovery of conditions (particle interactions, external fields, etc.) that lead to assembly of novel architectures. jamming_design_3 (); nucleation_id (); template_directed (); st_inv_des_review (); swarm_id ()

One challenge to ID is to delineate an effective scheme to update the parameters; SI techniques can be leveraged to construct an optimization framework for such purposes. For instance, the strategy outlined in Sect. 3 for modelling an unknown probability distribution function can be adapted for application to complex materials design problems by “mapping” a contrived probability distribution ()–which has been intentionally imbued with some desired property–onto a statistical mechanical model. Standard SI approaches can then be used to discover the optimal parameters needed to reproduce the property present in .

SI, therefore, can provide a prescription for performing updates to a probability distribution that is constrained by a statistical mechanical model. In the context of equilibrium molecular simulation, the clear choice for the probability of observing a given configuration X is the Boltzmann distribution


The probability for the configuration X, where X can represent continuous particle coordinates, discrete spins, etc., can be tuned via the set of variables, which parametrize the configuration energy . The focus of the remainder of this section is primarily on two recent advances in novel materials discovery, each pertaining to complementary aspects of materials design: structure and properties. Both strategies optimize a probability distribution toward the design goal, where the optimization is guided by SI.

4.1 Self-Assembly

Top down fabrication of materials with desired nanoscale structural characteristics remains a formidable technological challenge. tom_avni_perspective () Spontaneous self-assembly of constituent nanoscale particles provides an alternative bottom up approach for realizing materials with such designer morphologies; however, careful tuning of inter-particle interactions is required. granick (); patchy_review (); janus (); poly_packing (); empty_liquids (); lamellae (); patchy_micelles () A recent series of studies outlined a statistical mechanical ID strategy for the discovery of such interactions.

For a particle-based system, the relevant coordinate () is the set of D-dimensional Cartesian particle positions, . Prior to performing SI, a contrived target probability distribution over configuration space, , that yields an ensemble of “desirable” configurations must be specified. clusters_pores_crystals (); crystals (); pores (); many_pores (); clusters () Simulations constrained by either many-body interactions or by use of multiple components that enforce the desired structure have been successfully employed for this purpose. Then the KL-divergence (Eqn. 5) between and is minimized (where plays the role of from Sect. 3). In Eqn. 5, the only term that depends on the tunable parameters is the second “overlap” contribution, .

Maximizing with respect to has an intuitive probabilistic interpretation under fairly nonrestrictive assumptions. In particular, it is equivalent to maximizing the probability for the parametric model, , to sample an independently identically distributed set of configurations drawn from in the large sample limit ()–formally known as the maximum-likelihood approach of SI. bayesian_reasoning (); statistical_learning ()

Direct maximization of is intractable due to the presence of in Eqn. 6; however, iterative updates to are easy to compute via natural gradient ascent. clusters_pores_crystals (); crystals () Under the constraint of isotropic pair interactions, i.e., where , gradient ascent optimization in the canonical ensemble corresponds to


where is the radial distribution function, is the “learning” rate which is empirically set to guarantee stability of the optimization, and the definition of from Eqn. 6 has been used. We note the analogy between the above approach to that outlined in Sect. 3; in both, pair correlations (here, ) are used as a proxy for the full probability distribution. In a sense, MD configurations represent a familiar example of sparsely sampled data: there are infinite possibilities for particle arrangements, and only some finite subset are sampled in an MD simulation.

Figure 6: Representative configurations obtained with the ID pair potentials for (a) clusters, (b) pores, and (c) the truncated hexagonal lattice.

The above approach has been successfully employed to engineer pair interactions that promote formation of clustered fluids, porous mesophase assemblies, and crystalline lattices; see Fig. 6. The strategy is sufficiently general to handle all of these disparate architectures, with the primary difference being the construction of appropriate target ensembles. In all cases, standard simulation tools were employed with constraints chosen to enforce desired structural motifs. For cluster fluids, a many-body constraint upon the radius of gyration of each cluster was employed in conjunction with inter-cluster repulsive interactions that effectively enforced a minimum distance between particles in different clusters. clusters (); clusters_pores_crystals () In the case of a porous mesophase, a binary mixture of size-asymmetric hard-core-like particles was constructed, where the large particles created spherical zones of exclusion with respect to the smaller particles. pores (); clusters_pores_crystals () Finally, mutually non-interacting particles harmonically tethered to the desired lattice sites were used to emulate a crystalline lattice. crystals (); clusters_pores_crystals ()

Though some of the above architectures had previously been observed in MD simulations with pairwise interactions, particularly for clustered phases, postulated_pore_phases_sear (); simulated_pore_phases (); generalized_cluster (); bolhuis_clusters () no systematic framework existed for controlling the size of mesoscale structural features (i.e., clusters or the pores). By encoding the desired cluster or pore size in the target ensembles, interparticle interactions were discovered that reflected the prescribed mesophase dimension, allowing for the diameter of pores or the number of particles in a cluster to be tuned by modulating .

As can be visually gleaned from Fig. 6, the desired motifs (clusters, pores, and crystals) all require highly structured particle arrangements, intuitively indicating a target ensemble with relatively low entropy. As discussed in Sect. 3.2, such low entropy ensembles are generally amenable to pairwise models, and the successes outlined above seem to support this finding. pairwise_sufficiency () When there is flexibility in the creation of the target ensemble, a reasonable design principle might be to choose the lowest entropy ensemble that is practical. The systematic implementation of such a goal is an open question for future work.

4.2 Material Properties

While Sect. 4.1 outlines a robust methodology for designing structure, the relationship between structure and material properties is a complicated many-body physics problem. Therefore, a systematic, statistical mechanical framework for the direct optimization of targeted material properties is complementary to the structure-based ID approach outlined above.

One such ID scheme leverages the probabilistic micro-state configuration information afforded by a statistical mechanics driven approach. The key requirement of the method is a link between configuration, R, and a configuration-dependent FOM, , which quantifies the user’s goal. design_engines () For targeting a material property, could be the material property itself or a related metric. Derivation of a heuristic optimization strategy starts with an idealized case where is infinitely flexible and can be additively updated by some at each individual R. While the choice of is non-unique, it is constrained by the requirement that normality of the probability is preserved, , and that it generally increases (decreases) the probability for R that has improved (decreased) fitness. One simple choice that guarantees both is


Constraining the probability distribution to conform to a realistic statistical mechanical model necessarily limits the possibilities for ; i.e., the probability of observing a single configuration cannot generally be tuned independently of the probabilities associated with other configurations. Thus, exactly realizing Eqn. 8 is not generally feasible. Instead, one seeks the best approximation to the ideal update afforded by a change in parameters () via


by minimizing the average (over ) sum of square differences between the ideal and realizable updates. Similar to Sect. 4.1, the updates to the parameters can be computed iteratively.

Figure 7: ID of a polymer chain that folds into an octahedron. (a) Representative polymer configurations at specified optimization step numbers. (b) Convergence of the optimization as measured by the mean percent deviation between the average polymer radius of gyration () and the desired octahedron value (). (c) Evolution of the energetic coupling constants during the optimization. (inset) Schematic representation of optimized interaction network, where thicker lines imply stronger couplings. Figure reproduced from design_engines ().

The above methodology has been shown to be successful in a variety of contexts, including for a square-lattice Ising model where interaction parameters were tuned to maximize magnetization, optimization of a trap for a thermalized particle constrained to a 2D substrate, self-assembly of block copolymers on a chemically patterned substrate, and discovery of pairwise interactions that result in a six-bead chain spontaneously folding into an octahedron. For the last case, each bead, a coarse-grained representation of a polymer, was modeled as a hard sphere connected by fixed bond lengths between nearest neighbors. Instead of optimizing for structure directly as in Sect. 4.1, the radius of gyration [] of the chain was used to quantify the polymer shape. The qualitative progress of the optimization, progressing from an extended chain to a compact object, can be seen in Fig. 7a, where the index indicates the optimization step. More quantitatively, the metric (percent deviation from target, essentially an inverse fitness) decreased monotonically over the optimization; see Fig. 7b, where is the mean radius of gyration. After roughly 200 iterations, the mean deviation from an octahedral was only around 1%; it is visually apparent in Fig. 7a that the final steps in the optimization resulted in an octahedron.

The evolution of the couplings () over the course of the optimization are shown in Fig. 7c. The inset schematic shown in Fig. 7c summarizes the optimized results as the interaction network where thicker lines indicate stronger couplings. Though symmetry is not explicitly enforced, the mirror symmetry present in both the bead model and the octahedron is manifest (to a good approximation) in the interactions, i.e., , etc. Furthermore, the resulting interactions are relatively intuitive; for instance, pairs of beads that are positioned co-linearly with the center of the octahedron, and therefore are farther apart, have the weakest couplings ().

For the above proof-of-concept example, the relationship between structure and property () is rather trivial, but this need not be the case. In principle, the method can be straightforwardly applied for any material property which uniquely depends on particle coordinates. For example, other structural metrics like the fractal dimension and average spacing between particles are feasible. More generally, quantities that use particle coordinates in intermediate theoretical calculations, such as electron hopping conductivity, conductivity () can be optimized.

5 Conclusions and outlook

Techniques of ML and SI are useful in a wide variety of physical chemistry applications in both experimental and computational contexts–a sampling of which has been provided in this review. For example, Bayesian optimization is straightforward to incorporate into experimental design loops to systematize the selection of new samples to prepare and measure. On the computational side, ML tools have been incorporated into MD simulations to accelerate QM force calculations, though other expensive computations for simulation could potentially be facilitated by ML as well. Also reviewed were applications of SI for modeling complex systems, with case studies taken from 1) biology in the form of viral sequences and gene expression data and 2) materials design of both structure and properties.

It is relatively early on in the adoption of ML and SI tools in the physical sciences. Undoubtedly many new applications remain to be discovered. However, one broad implication of further integrating ML and SI into research is the promise of increased opportunities for cooperation between experimental and computational research. For example, the viral sequence study used real clinical data to create an empirical computational model that may be translated into rational, real world, vaccine design. Similarly, Bayesian optimization can leverage theoretical frameworks to construct features to be input into a numerical optimization procedure that guides experimental design. In the future, perhaps the primary utility of ML and SI methodologies will be the erosion of the distinctions between experiment, computation, and theory, ultimately facilitating both collaboration and discovery.


We are grateful for comments and suggestions provided by Andrew Ferguson on a working draft of this review. Work from the authors presented here was partially supported by the National Science Foundation (1247945) and the Welch Foundation (F-1696). We acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources.


  • (1) P. Campbell, “Editorial on special issue on big data: Community cleverness required,” Nature, vol. 455, no. 7209, p. 1, 2008.
  • (2) K. Cukier, Data, data everywhere: A special report on managing information. Economist Newspaper, 2010.
  • (3) G. Brumfiel, “Down the petabyte highway,” Nature, vol. 469, no. 20, pp. 282–283, 2011.
  • (4) J. Lighthill, “Artificial intelligence: A general survey,” Science Research Council, 1973.
  • (5) S. J. Russell and P. Norvig, Artificial intelligence: A modern approach. Prentice Hall, 3rd ed., 2009.
  • (6) D. Cervier, AI: The Tumultuous Search for Artificial Intelligence. Basic Books, 1993.
  • (7) A. C. Clarke, 2001: A space odyssey. New American Library, 1968.
  • (8) “Artificial intelligence: Google’s AlphaGo beats Go master Lee Se-dol,” BBC, 2016.
  • (9) D. Mackenzie, “Update: Why this week’s man-versus-machine Go match doesn’t matter (and what does),” Science, 2016.
  • (10) P. S. Churchland and T. J. Sejnowski, The computational brain: 25th anniversary edition. MIT press, 2016.
  • (11) D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
  • (12) J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer Series in Statistics, Springer-Verlag New York, 2nd ed., 2009.
  • (13) D. Xue, P. V. Balachandran, J. Hogden, J. Theiler, D. Xue, and T. Lookman, “Accelerated search for materials with targeted properties by adaptive design,” Nat. Commun., vol. 7, p. 11241, 2016.
  • (14) D. Xue, R. Yuan, Y. Zhou, D. Xue, T. Lookman, G. Zhang, X. Ding, and J. Sun, “Design of high temperature Ti-Pd-Cr shape memory alloys with small thermal hysteresis,” Sci. Rep., vol. 6, p. 28244, 2016.
  • (15) D. Xue, P. V. Balachandran, R. Yuan, T. Hu, X. Qian, E. R. Dougherty, and T. Lookman, “Accelerated search for BaTiO-based piezoelectrics with vertical morphotropic phase boundary using bayesian learning,” Proc. Natl. Acad. Sci. USA, vol. 113, no. 47, pp. 13301–13306, 2016.
  • (16) P. V. Balachandran, D. Xue, J. Theiler, J. Hogden, and T. Lookman, “Adaptive strategies for materials design using uncertainties,” Sci. Rep., vol. 6, p. 19660, 2016.
  • (17) V. Botu and R. Ramprasad, “Adaptive machine learning framework to accelerate ab initio molecular dynamics,” Int. J. Quantum Chem., vol. 115, no. 16, pp. 1074–1083, 2015.
  • (18) V. Botu and R. Ramprasad, “Learning scheme to predict atomic forces and accelerate materials simulations,” Phys. Rev. B, vol. 92, p. 094306, 2015.
  • (19) V. Botu, R. Batra, J. Chapman, and R. Ramprasad, “Machine learning force fields: Construction, validation, and outlook,” J. Phys. Chem. C, vol. 121, no. 1, pp. 511–522, 2017.
  • (20) J. K. Mann, J. P. Barton, A. L. Ferguson, S. Omarjee, B. D. Walker, A. Chakraborty, and T. Ndung’u, “The fitness landscape of HIV-1 Gag: Advanced modeling approaches and validation of model predictions by in vitro testing,” PLoS Comput. Biol., vol. 10, no. 8, pp. 1–11, 2014.
  • (21) K. Shekhar, C. F. Ruberman, A. L. Ferguson, J. P. Barton, M. Kardar, and A. K. Chakraborty, “Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes,” Phys. Rev. E: Stat. Nonlin. Soft Matter Phys., vol. 88, p. 062705, 2013.
  • (22) G. R. Hart and A. L. Ferguson, “Empirical fitness models for hepatitis C virus immunogen design,” Phys. Biol., vol. 12, no. 6, p. 066006, 2015.
  • (23) G. R. Hart and A. L. Ferguson, “Error catastrophe and phase transition in the empirical fitness landscape of HIV,” Phys. Rev. E: Stat. Nonlin. Soft Matter Phys., vol. 91, p. 032705, 2015.
  • (24) T. R. Lezon, J. R. Banavar, M. Cieplak, A. Maritan, and N. V. Fedoroff, “Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns,” Proc. Natl. Acad. Sci. USA, vol. 103, no. 50, pp. 19033–19038, 2006.
  • (25) R. B. Jadrich, B. A. Lindquist, and T. M. Truskett, “Probabilistic inverse design for self-assembling materials,” J. Chem. Phys., vol. 146, no. 18, p. 184103, 2017.
  • (26) B. A. Lindquist, R. B. Jadrich, and T. M. Truskett, “Communication: Inverse design for self-assembly via on-the-fly optimization,” J. Chem. Phys., vol. 145, no. 11, p. 111101, 2016.
  • (27) B. A. Lindquist, R. B. Jadrich, and T. M. Truskett, “Assembly of nothing: equilibrium fluids with designed structured porosity,” Soft Matter, vol. 12, pp. 2663–2667, 2016.
  • (28) B. A. Lindquist, S. Dutta, R. B. Jadrich, D. J. Milliron, and T. M. Truskett, “Interactions and design rules for assembly of porous colloidal mesophases,” Soft Matter, vol. 13, pp. 1335–1343, 2017.
  • (29) R. B. Jadrich, J. A. Bollinger, B. A. Lindquist, and T. M. Truskett, “Equilibrium cluster fluids: pair interactions via inverse design,” Soft Matter, vol. 11, pp. 9342–9354, 2015.
  • (30) M. Z. Miskin, G. Khaira, J. J. de Pablo, and H. M. Jaeger, “Turning statistical physics models into materials design engines,” Proc. Natl. Acad. Sci. USA, vol. 113, no. 1, pp. 34–39, 2016.
  • (31) E. Brochu, Interactive Bayesian Optimization: Learning User Preferences for Graphics and Animation. PhD thesis, The University of British Columbia, 2010.
  • (32) Y. Tenne and C.-K. Goh (Eds.), Computational intelligence in expensive optimization problems, vol. 2. Springer-Verlag Berlin Heidelberg, 2010.
  • (33) W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical recipes in C. Cambridge University Press, 3rd ed., 2007.
  • (34) I. W. Hamley, Introduction to soft matter: Synthetic and biological self-assembling materials. John Wiley & Sons, 2007.
  • (35) T. A. Witten and P. A. Pincus, Structured fluids: polymers, colloids, surfactants. Oxford University Press, 2004.
  • (36) E. T. Jaynes, “Information theory and statistical mechanics,” Phys. Rev., vol. 106, pp. 620–630, 1957.
  • (37) C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
  • (38) S. Pressé, K. Ghosh, J. Lee, and K. A. Dill, “Principles of maximum entropy and maximum caliber in statistical physics,” Rev. Mod. Phys., vol. 85, pp. 1115–1141, 2013.
  • (39) R. Hanel, S. Thurner, and M. Gell-Mann, “How multiplicity determines entropy and the derivation of the maximum entropy principle for complex systems,” Proc. Natl. Acad. Sci. USA, vol. 111, no. 19, pp. 6905–6910, 2014.
  • (40) M. Castellana and W. Bialek, “Inverse spin glass and related maximum entropy problems,” Phys. Rev. Lett., vol. 113, p. 117204, 2014.
  • (41) F. Y. Wu, “The Potts model,” Rev. Mod. Phys., vol. 54, pp. 235–268, 1982.
  • (42) S. Crotty, C. E. Cameron, and R. Andino, “RNA virus error catastrophe: Direct molecular test by using ribavirin,” Proc. Natl. Acad. Sci. USA, vol. 98, no. 12, pp. 6895–6900, 2001.
  • (43) J. Summers and S. Litwin, “Examining the theory of error catastrophe,” J. Virol., vol. 80, no. 1, pp. 20–26, 2006.
  • (44) R. A. Smith, L. A. Loeb, and B. D. Preston, “Lethal mutagenesis of HIV,” Virus Res., vol. 107, no. 2, pp. 215 – 228, 2005.
  • (45) M. J. Dapp, C. L. Clouser, S. Patterson, and L. M. Mansky, “5-Azacytidine can induce lethal mutagenesis in human immunodeficiency virus type 1,” J. Virol., vol. 83, no. 22, pp. 11950–11958, 2009.
  • (46) J. I. Mullins, L. Heath, J. P. Hughes, J. Kicha, S. Styrchak, K. G. Wong, U. Rao, A. Hansen, K. S. Harris, J.-P. Laurent, D. Li, J. H. Simpson, J. M. Essigmann, L. A. Loeb, and J. Parkins, “Mutation of HIV-1 genomes in a clinical population treated with the mutagenic nucleoside KP1461,” PLOS ONE, vol. 6, no. 1, pp. 1–10, 2011.
  • (47) L. Merchan and I. Nemenman, “On the sufficiency of pairwise interactions in maximum entropy models of networks,” J. Stat. Phys., vol. 162, no. 5, pp. 1294–1308, 2016.
  • (48) T. R. Kirkpatrick and D. Thirumalai, “p-spin-interactions spin-glass models: Connections with the structural glass problem,” Phys. Rev. B, vol. 36, pp. 5388–5397, 1987.
  • (49) L. Vlcek and A. A. Chialvo, “Rigorous force field optimization principles based on statistical distance minimization,” J. Chem. Phys., vol. 143, no. 14, p. 144110, 2015.
  • (50) M. S. Shell, Coarse-Graining with the Relative Entropy, pp. 395–441. in Advances in Chemical Physics (eds S. A. Rice and A. R. Dinner), John Wiley & Sons, Inc., 2016.
  • (51) W. G. Noid, “Perspective: Coarse-grained models for biomolecular systems,” J. Chem. Phys., vol. 139, no. 9, p. 090901, 2013.
  • (52) F. Krz̧akała, A. Montanari, F. Ricci-Tersenghi, G. Semerjian, and Zdeborová, “Gibbs states and the set of solutions of random constraint satisfaction problems,” Proc. Natl. Acad. Sci. USA, vol. 104, no. 25, pp. 10318–10323, 2007.
  • (53) D. Achlioptas, “Solution clustering in random satisfiability,” Eur. Phys. J. B, vol. 64, no. 3, p. 395, 2008.
  • (54) D. S. Rokhsar, P. W. Anderson, and D. L. Stein, “Self-organization in prebiological systems: Simulations of a model for the origin of genetic information,” J. Mol. Evol., vol. 23, no. 2, pp. 119–126, 1986.
  • (55) T. Mora, A. M. Walczak, W. Bialek, and C. G. Callan, “Maximum entropy models for antibody diversity,” Proc. Natl. Acad. Sci. USA, vol. 107, no. 12, pp. 5405–5410, 2010.
  • (56) J. Otwinowski and I. Nemenman, “Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter,” PLOS ONE, vol. 8, no. 5, pp. 1–10, 2013.
  • (57) W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Silvestri, M. Viale, and A. M. Walczak, “Statistical mechanics for natural flocks of birds,” Proc. Natl. Acad. Sci. USA, vol. 109, no. 13, pp. 4786–4791, 2012.
  • (58) T. Mora and W. Bialek, “Are biological systems poised at criticality?,” J. Stat. Phys., vol. 144, no. 2, pp. 268–302, 2011.
  • (59) R. R. Klevecz, J. Bolen, G. Forrest, and D. B. Murray, “A genomewide oscillation in transcription gates DNA replication and cell cycle,” Proc. Natl. Acad. Sci. USA, vol. 101, no. 5, pp. 1200–1205, 2004.
  • (60) P. D’haeseleer, “How does gene expression clustering work?,” Nat. Biotechnol., vol. 23, no. 12, p. 1499, 2005.
  • (61) F. Sciortino, A. Giacometti, and G. Pastore, “Phase diagram of janus particles,” Phys. Rev. Lett., vol. 103, p. 237801, 2009.
  • (62) E. R. Chen, D. Klotsa, M. Engel, P. F. Damasceno, and S. C. Glotzer, “Complexity in surfaces of densest packings for families of polyhedra,” Phys. Rev. X, vol. 4, p. 011024, 2014.
  • (63) E. Bianchi, J. Largo, P. Tartaglia, E. Zaccarelli, and F. Sciortino, “Phase diagram of patchy colloids: Towards empty liquids,” Phys. Rev. Lett., vol. 97, p. 168301, 2006.
  • (64) Z. Preisler, T. Vissers, F. Smallenburg, G. Munaò, and F. Sciortino, “Phase diagram of one-patch colloids forming tubes and lamellae,” J. Phys. Chem. B, vol. 117, no. 32, pp. 9540–9547, 2013.
  • (65) D. J. Kraft, R. Ni, F. Smallenburg, M. Hermes, K. Yoon, D. A. Weitz, A. van Blaaderen, J. Groenewold, M. Dijkstra, and W. K. Kegel, “Surface roughness directed self-assembly of patchy particles into colloidal micelles,” Proc. Natl. Acad. Sci. USA, vol. 109, no. 27, pp. 10787–10792, 2012.
  • (66) L. K. Roth and H. M. Jaeger, “Optimizing packing fraction in granular media composed of overlapping spheres,” Soft Matter, vol. 12, pp. 1107–1115, 2016.
  • (67) W. L. Miller and A. Cacciuto, “Exploiting classical nucleation theory for reverse self-assembly,” J. Chem. Phys., vol. 133, no. 23, p. 234108, 2010.
  • (68) A. F. Hannon, Y. Ding, W. Bai, C. A. Ross, and A. Alexander-Katz, “Optimizing topographical templates for directed self-assembly of block copolymers via inverse design simulations,” Nano Lett., vol. 14, no. 1, pp. 318–325, 2014.
  • (69) S. Torquato, “Inverse optimization techniques for targeted self-assembly,” Soft Matter, vol. 5, pp. 1157–1173, 2009.
  • (70) S. P. Paradiso, K. T. Delaney, and G. H. Fredrickson, “Swarm intelligence platform for multiblock polymer inverse formulation design,” ACS Macro Lett., vol. 5, no. 8, pp. 972–976, 2016.
  • (71) A. Jain, J. A. Bollinger, and T. M. Truskett, “Inverse methods for material design,” AIChE Journal, vol. 60, no. 8, pp. 2732–2740, 2014.
  • (72) Q. Chen, S. C. Bae, and G. Steve, “Directed self-assembly of a colloidal kagome lattice,” Nature, vol. 469, pp. 381–384, 2011.
  • (73) E. Bianchi, R. Blaak, and C. N. Likos, “Patchy colloids: state of the art and perspectives,” Phys. Chem. Chem. Phys., vol. 13, pp. 6397–6410, 2011.
  • (74) R. P. Sear and W. M. Gelbart, “Microphase separation versus the vapor-liquid transition in systems of spherical particles,” J. Chem. Phys., vol. 110, no. 9, pp. 4582–4588, 1999.
  • (75) A. J. Archer and N. B. Wilding, “Phase behavior of a fluid with competing attractive and repulsive interactions,” Phys. Rev. E: Stat., Nonlinear, Soft Matter Phys., vol. 76, p. 031501, 2007.
  • (76) P. D. Godfrin, N. E. Valadez-Pérez, R. Castañeda-Priego, N. J. Wagner, and Y. Liu, “Generalized phase behavior of cluster formation in colloidal dispersions with competing interactions,” Soft Matter, vol. 10, pp. 5061–5071, 2014.
  • (77) E. Mani, W. Lechner, W. K. Kegel, and P. G. Bolhuis, “Equilibrium and non-equilibrium cluster phases in colloids with competing interactions,” Soft Matter, vol. 10, pp. 4479–4486, 2014.
  • (78) C. Grimaldi, “A complete graph effective medium approximation for lattice and continuum percolation,” EPL, vol. 96, no. 3, p. 36004, 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description