Toward an AI Physicist for Unsupervised Learning
Abstract
We investigate opportunities and challenges for improving unsupervised machine learning using four common strategies with a long history in physics: divideandconquer, Occam’s razor, unification and lifelong learning. Instead of using one model to learn everything, we propose a novel paradigm centered around the learning and manipulation of theories, which parsimoniously predict both aspects of the future (from past observations) and the domain in which these predictions are accurate. Specifically, we propose a novel generalizedmeanloss to encourage each theory to specialize in its comparatively advantageous domain, and a differentiable description length objective to downweight bad data and “snap” learned theories into simple symbolic formulas. Theories are stored in a “theory hub”, which continuously unifies learned theories and can propose theories when encountering new environments. We test our implementation, the toy “AI Physicist” learning agent, on a suite of increasingly complex physics environments. From unsupervised observation of trajectories through worlds involving random combinations of gravity, electromagnetism, harmonic motion and elastic bounces, our agent typically learns faster and produces meansquared prediction errors about a billion times smaller than a standard feedforward neural net of comparable complexity, typically recovering integer and rational theory parameters exactly. Our agent successfully identifies domains with different laws of motion also for a nonlinear chaotic double pendulum in a piecewise constant force field.
I Introduction
i.1 Motivation
The ability to predict, analyze and parsimoniously model observations is not only central to physics, but also a goal of unsupervised machine learning, which is a key frontier in artificial intelligence (AI) research LeCun et al. (2015). Despite impressive recent progress with artificial neural nets, they still get frequently outmatched by human researchers at such modeling, suffering from two drawbacks:

Different parts of the data are often generated by different mechanisms in different contexts. A big model that tries to fit all the data in one environment may therefore underperform in a new environment where some mechanisms are replaced by new ones, being inflexible and inefficient at combinatorial generalization Battaglia et al. (2018).

Big models are generally hard to interpret, and may not reveal succinct and universal knowledge such as Newton’s law of gravitation that explains only some aspects of the data. The pursuit of “intelligible intelligence” in place of inscrutable blackbox neural nets is important and timely, given the growing interest in AI interpretability from AI users and policymakers, especially for AI components involved in decisions and infrastructure where trust is important Russell et al. (2015); Amodei et al. (2016); Boden et al. (2017); Krakovna and DoshiVelez (2016).
Strategy  Definition 

Divideand  Learn multiple theories each of which 
conquer  specializes to fit part of the data very well 
Occam’s  Avoid overfitting by minimizing description 
Razor  length, which can include replacing fitted constants by simple integers or fractions. 
Unification  Try unifying learned theories by introducing parameters 
Lifelong  Remember learned solutions and try them 
Learning  on future problems 
To address these challenges, we will borrow from physics the core idea of a theory, which parsimoniously predicts both aspects of the future (from past observations) and also the domain in which these predictions are accurate. This suggests an alternative to the standard machinelearning paradigm of fitting a single big model to all the data: instead, learning small theories one by one, and gradually accumulating and organizing them. This paradigm suggests the four specific approaches summarized in Table 1, which we combine into a toy “AI Physicist” learning agent: To find individual theories from complex observations, we use the divideandconquer strategy with multiple theories and a novel generalizedmean loss that encourages each theory to specialize in its own domain by giving larger gradients for betterperforming theories. To find simple theories that avoid overfitting and generalize well, we use the strategy known as Occam’s razor, favoring simple theories that explain a lot, using a computationally efficient approximation of the minimumdescriptionlength (MDL) formalism. To unify similar theories found in different environments, we use the description length for clustering and then learn a “master theory” for each class of theories. To accelerate future learning, we use a lifelong learning strategy where learned theories are stored in a theory hub for future use.
i.2 Goals & relation to prior work
The goal of the AI Physicist learning agent presented in this paper is quite limited, and does not even remotely approach the ambition level of problem solving by human physicists. The latter is likely to be almost as challenging as artificial general intelligence, which most AI researchers guess remains decades away Müller and Bostrom (2016); Grace et al. (2018). Rather, the goal of this paper is to take a very modest but useful step in that direction, combining the four physicsinspired strategies above.
Our approach complements other work on automatic program learning, such as neural program synthesis/induction Graves et al. (2014); Sukhbaatar et al. (2015); Reed and De Freitas (2015); Parisotto et al. (2016); Devlin et al. (2017); Bramley et al. (2018) and symbolic program induction Muggleton (1991); Lavrac and Dzeroski (1994); Liang et al. (2010); Ellis et al. (2015); Dechter et al. (2013) and builds on prior machinelearning work on divideandconquer Cormen et al. (2009); Fürnkranz (1999); Ghosh et al. (2017), network simplification Rissanen (1978); Hassibi and Stork (1993); Suzuki et al. (2001); Grünwald et al. (2005); Han et al. (2015a, b) and continuous learning Kirkpatrick et al. (2017); Li and Hoiem (2018); LopezPaz and Ranzato (2017); Nguyen et al. (2017). It is often said that babies are born scientists, and there is arguably evidence for use of all of these four strategies during childhood development as well Bramley et al. (2018).
There has been significant recent progress on AIapproaches specifically linked to physics, including physical scene understanding Yildirim et al. (2018), latent physical properties Zheng et al. (2018); Battaglia et al. (2016); Chang et al. (2016), learning physics simulators Watters et al. (2017), physical concept discovery Iten et al. (2018), an intuitive physics engine Lake et al. (2017), and the “Sir Isaac” automated adaptive inference agent Daniels and Nemenman (2015). Our AI Physicist is different and complementary in two fundamental ways that loosely correspond to the two motivations on the first page:

All of these papers learn one big model to fit all the data. In contrast, the AI Physicist learns many small models applicable in different domains, using the divideandconquer strategy.

Our primary focus is not on making approximate predictions or discovering latent variables, but on nearexact predictions and complete intelligibility. From the former perspective, it is typically irrelevant if a model parameter changes by a tiny amount, but from a physics perspective, one is quite interested to learn that gravity weakens like distance to the power rather than .
We share this focus on intelligibility with a long tradition of research on computational scientific discovery (Džeroski et al., 2007), including the Bacon system Langley (1981) and its successors Langley and Zytkow (1989), which induced physical laws from observations and which also used a divideandconquer strategy. Other work has extended this paradigm to support discovery of differential equation models from multivariate time series Dzeroski and Todorovski (1995); Bradley et al. (2001); Langley et al. (2003); Langley and Arvay (2015).
The rest of this paper is organized as follows. In Section II, we introduce the architecture of our AI Physicist learning agent, and the algorithms implementing the four strategies. We present the results of our numerical experiments using a suite of physics environment benchmarks in Section III, and discuss our conclusions in Section IV, delegating supplementary technical details to a series of appendices.
Ii Methods
Unsupervised learning of regularities in time series can be viewed as a supervised learning problem of predicting the future from the past. This paper focuses on the task of predicting the next state vector in a sequence from the concatenation of the last vectors. However, our AI Physicist formalism applies more generally to learning any function from examples. In the following we first define theory, then introduce a unified AI Physicist architecture implementing the four aforementioned strategies.
ii.1 Definition of Theory
A theory is a 2tuple , where f is a prediction function that predicts when is within the theory’s domain, and is a domain subclassifier which takes as input and outputs a logit of whether is inside this domain. When multiple theories are present, the subclassifier ’s outputs are concatenated and fed into a softmax function, producing probabilities for which theory is applicable. Both f and can be implemented by a neural net or symbolic formula, and can be set to learnable during training and fixed during prediction/validation.
This definition draws inspirations from physics theories (conditional statements), such as “a ball not touching anything (condition) with vertical velocity and height will a time later have (prediction function)”. For our AI Physicist, theories constitute its “atoms” of learning, as well as the building blocks for higherlevel manipulations.
ii.2 AI Physicist Architecture Overview
Figure 1 illustrates the architecture of the AI Physicist learning agent. At the center is a theory hub which stores the learned and organized theories. When encountering a new environment, the agent first inspects the hub and proposes old theories that help account for parts of the data as well as randomly initialized new theories for the rest of the data. All these theories are trained via our divideandconquer strategy, first jointly with our generalizedmean loss then separately to finetune each theory in its domain (section II.3). Successful theories along with the corresponding data are added to the theory hub.
The theory hub has two organizing strategies: (1) Applying Occam’s razor, it snaps the learned theories, in the form of neural nets, into simpler symbolic formulas (section II.4). (2) Applying unification, it clusters and unifies the symbolic theories into master theories (section II.5). The symbolic and master theories can be added back into the theory hub, improving theory proposals for new environments. The detailed AI Physicist algorithm is presented in a series of appendices. It has polynomial time complexity, as detailed in Appendix F.
ii.3 DivideandConquer
Conventionally, a function f mapping is learned by parameterizing f by some parameter vector that is adjusted to minimize a loss (empirical risk)
(1) 
where is some nonnegative distance function quantifying how far each prediction is from the target, typically satisfying . In contrast, a physicist observing an unfamiliar environment does typically not try to predict everything with one model, instead starting with an easier question: is there any part or aspect of the world that can be described? For example, when Galileo famously tried to model the motion of swinging lamps in the Pisa cathedral, he completely ignored everything else, and made no attempts to simultaneously predict the behavior of sound waves, light rays, weather, or subatomic particles. In this spirit, we allow multiple competing theories , , to specialize in different domains, with a novel generalizedmean loss
(2) 
When , the loss will be dominated by whichever prediction function fits each data point best. This dominance is controlled by , with in the limit where . This means that the best way to minimize is for each to specialize by further improving its accuracy for the data points where it already outperforms the other theories. The following Theorem 1 formalizes the above intuition, stating that under mild conditions for the loss function , the generalizedmean loss gives larger gradient w.r.t. the error for theories that perform better, so that a gradientdescent loss minimization encourages specialization.
Theorem 1
Let denote the prediction of the target by the function , .
Suppose that and for a monotonically increasing function
that vanishes on for some , with differentiable and strictly convex for .
Then if , we have
(3) 
where .
Appendix G gives the proof, and also shows that this theorem applies to meansquarederror (MSE) loss , meanabsoluteerror loss , Huber loss and our descriptionlength loss from the next section.
We find empirically that the simple choice works quite well, striking a good balance between encouraging specialization for the best theory and also giving some gradient for theories that currently perform slightly worse. We term this choice the “harmonic loss”, because it corresponds to the harmonic mean of the losses for the different theories. Based on the harmonic loss, we propose an unsupervised differentiable divideandconquer (DDAC) algorithm (Alg. 2 in Appendix B) that simultaneously learns prediction functions and corresponding domain classifiers from observations.
Our DDAC method’s combination of multiple prediction modules into a single prediction is reminiscent of AdaBoost Freund and Schapire (1997). While AdaBoost gradually upweights those modules that best predict all the data, DDAC instead identifies complementary modules that each predict some part of the data best, and encourages these modules to simplify and improve by specializing on these respective parts.
ii.4 Occam’s Razor
The principle of Occam’s razor, that simpler explanations are better, is quite popular among physicists. This preference for parsimony helped dispense with phlogiston, aether and other superfluous concepts.
Our method therefore incorporates the minimumdescriptionlength (MDL) formalism Rissanen (1978); Grünwald et al. (2005), which provides an elegant mathematical implementation of Occam’s razor. It is rooted in Solomonoff’s theory of inference Solomonoff (1964) and is linked to Hutter’s AIXI approach to artificial general intelligence Hutter (2000). The description length (DL) of a dataset is defined as the number of bits required to describe it. For example, if regularities are discovered that enable data compression, then the corresponding description length is defined as the number of bits of the program that produces as its output (including both the code bits and the compressed data bits). In our context of predicting a time series, this means that the description length is the number of bits required to describe the theories used plus the number of bits required to store all prediction errors. Finding the optimal data compression and hence computing the MDL is a famous hard problem that involves searching an exponentially large space, but any discovery reducing the description length is a step in the right direction, and provably avoids the overfitting problem that plagues many alternative machinelearning strategies Rissanen (1978); Grünwald et al. (2005).
The endgoal of the AI Physicist is to discover theories minimizing the total description length, given by
(4) 
where is the prediction error at time step . By discovering simple theories that can each account for parts of the data very well, the AI Physicist strives to make both and small.
Physics has enjoyed great success in its pursuit of simpler theories using rather vague definitions of simplicity. In the this spirit, we choose to compute the description length DL not exactly, but using an approximate heuristic that is numerically efficient, and significantly simpler than more precise versions such as Rissanen (1983), paying special attention to rational numbers since they are appear in many physics theories. We compute the DL of both theories and prediction errors as the sum of the DL of all numbers that specify them, using the following conventions for the DL of integers, rational numbers and real numbers. Our MDL implementation differs from popular machinelearning approaches whose goal is efficiency and generalizability Hinton and van Camp (1993); Han et al. (2015a); Blier and Ollivier (2018) rather than intelligibility.
The number of binary digits required to specify a natural number is approximately , so we define for natural numbers. For an integer , we define
(5) 
For a rational number , the description length is the sum of that for its integer numerator and (natural number) denominator, as illustrated in Figure 2:
(6) 
For a real number and a numerical precision floor , we define
(7) 
where the function
(8) 
is plotted in Figure 2. Since for , is approximately the description length of the integer closest to . Since for , simplifies to a quadratic (meansquarederror) loss function below the numerical precision, which will prove useful below.^{1}^{1}1Natural alternative definitions of include , , and . Unless otherwise specified, we choose in our experiments.
Note that as long as all prediction absolute errors for some dataset, minimizing the total description length instead of the MSE corresponds to minimizing the geometric mean instead of the arithmetic mean of the squared errors, which encourages focusing more on improving already wellfit points. drops by 1 bit whenever one prediction error is halved, which is can typically be achieved by finetuning the fit for many valid data points that are already well predicted while increasing DL for bad or extraneous points at most marginally.
For numerical efficiency, our AI Physicist minimizes the description length of equation (4) in two steps: 1) All model parameters are set to trainable real numbers, and the DDAC algorithm is applied to minimize the harmonic loss with using equation (7) and the annealing procedure for the precision floor described in Appendix B. 2) Some model parameters are replaced by rational numbers as described below, followed by reoptimization of the other parameters. The idea behind the second step is that if a physics experiment or neural net training produces a parameter , it would be natural to interpret this as a hint, and to check if gives an equally acceptable fit to the data, reducing total DL. We implement step 2 using continued fraction expansion as described in Appendix C and illustrated in Figure 3.
ii.5 Unification
Physicists aspire not only to find simple theories that explain aspects of the world accurately, but also to discover underlying similarities between theories and unify them. For example, when James Clerk Maxwell corrected and unified four key formulas describing electricity and magnetism into his eponymous equations (, in differential form notation), he revealed the nature of light and enabled the era of wireless communication.
Here we make a humble attempt to automate part of this process. The goal of the unification is to output a master theory , such that varying the parameter vector can generate a continuum of theories including previously discovered ones. For example, Newton’s law of gravitation can be viewed as a master theory unifying the gravitational force formulas around different planets by introducing a parameter corresponding to planet mass. Einstein’s special relativity can be viewed as a master theory unifying the approximate formulas for and motion.
We perform unification by first computing the description length of the prediction function (in symbolic form) for each theory and performing clustering on . Unification is then achieved by discovering similarities and variations between the symbolic formulas in each cluster, retaining the similar patterns, and introducing parameters in place of the parameters that vary as detailed in Appendix D.
ii.6 Lifelong Learning
Isaac Newton once said “If I have seen further it is by standing on the shoulders of giants”, emphasizing the utility of building on past discoveries. At a more basic level, our past experiences enable us humans to model new environments much faster than if we had to reacquire all our knowledge from scratch. We therefore embed a lifelong learning strategy into the architecture of the AI Physicist. As shown in Fig. 1 and Alg. 1, the theory hub stores successfully learned theories, organizes them with our Occam’s razor and unification algorithms (reminiscent of what humans do while dreaming and reflecting), and when encountering new environments, uses its accumulated knowledge to propose new theories that can explain parts of the data. This both ensures that past experiences are not forgotten and enables faster learning in novel environments. The detailed algorithms for proposing and adding theories are in Appendix E.
Iii Results of Numerical Experiments
iii.1 Physics Environments
We test our algorithms on two suites of benchmarks, each with increasing complexity. In all cases, the goal is to predict the twodimensional motion as accurately as possible. One suite involves chaotic and highly nonlinear motion of a charged double pendulum in two adjacent electric fields. The other suite involves balls affected by gravity, electromagnetic fields, springs and bounceboundaries, as exemplified in Figure 4. Within each spatial region, the force corresponds to a potential energy function for some constants , , , where (no force), (uniform electric or gravitational field), (spring obeying Hooke’s law) or (ideal elastic bounce), and optionally involves also a uniform magnetic field. The environments are summarized in Table 4.
iii.2 Numerical Results
In the mystery world example of Figure 4, after the DDAC algorithm 2 taking the sequence of coordinates as the only input, we see that the AI Physicist has learned to simultaneously predict the future position of the ball from the previous two, and classify without external supervision the observed inputs into four big physics domains. The predictions are seen to be more accurate deep inside the domains (tiny dots) than near boundaries (larger dots) where transitions and bounces create small domains with laws of motion that are harder to infer because of complexity and limited data. Because these small domains can be automatically inferred and eliminated once the large ones are known as described in Appendix H, all accuracy benchmarks quoted below refer to points in the large domains only.
After DDAC, the AI Physicist performs Occam’srazorwithMDL (Alg. 3) on the learned theories. As an example, it discovers that the motion deep inside the lowerleft quadrant obeys the difference equation parameterized by a learned 3layer neural net, which after the first collapseLayer transformation simplifies to
(9)  
with and . The snapping stage thereafter simplifies this to
(10) 
which has lower description length in both model bits () and data bits () and gets transformed to the symbolic expressions
(11) 
where we have writen the 2D position vector for brevity. During unification (Alg. D), the AI Physicist discovers multiple clusters of theories based on the DL of each theory, where one cluster has DL ranging between 48.86 and 55.63, which it unifies into a master theory with
(12) 
effectively discovering a “gravity” master theory out of the different types of environments it encounters. If so desired, the difference equations (III.2) can be automatically generalized to the more familiarlooking differential equations
where , and both the Harmonic Oscillator Equation and Lorentz Force Law of electromagnetism can be analogously autoinferred from other master theories learned.
Many mystery domains in our test suite involve laws of motion whose parameters include both rational and irrational numbers. To count a domain as “solved” below, we use the very stringent requirement that any rational numbers (including integers) must be discovered exactly, while irrational numbers must be recovered with accuracy .
We apply our AI Physicist to 40 mystery worlds in sequence (Appendix I). After this training, we apply it to a suite of 40 additional worlds to test how it learns different numbers of examples. The results are shown in tables 3 and 4, and Table 2 summarizes these results using the median over worlds. For comparison, we also show results for two simpler agents with similar parameter count: a “baseline” agent consisting of a threelayer feedforward MSEminimizing leakyReLU network and a “newborn” AI Physicist that has not seen any past examples and therefore cannot benefit from the lifelong learning strategy.
We see that the newborn agent outperforms baseline on all the tabulated measures, and that the AI Physicist does still better. Using all data, the Newborn agent and AI Physicist are able to predict with meansquared prediction error below , more than nine orders of magnitude below baseline. Moreover, the Newborn and AI Physicist agents are able to simultaneously learn the domain classifiers with essentially perfect accuracy, without external supervision. Both agents are able to solve above 90% of all the 40 mystery worlds according to our stringent criteria.
The main advantage of the AI Physicist over the Newborn agent is seen to be its learning speed, attaining given accuracy levels faster, especially during the early stage of learning. Remarkably, for the subsequent 40 worlds, the AI Physicist reaches 0.01 MSE within 35 epochs using as little as 1% of the data, performing almost as well as with 50% of the data much better than the Newborn agent. This illustrates that the lifelong learning strategy enables the AI Physicist to learn much faster in novel environments with less data. This is much like an experienced scientist can solve new problems way faster than a beginner by building on prior knowledge about similar problems.
Benchmark  Baseline  Newborn  AI Physicist 
meansquared error  3.89  13.95  13.88 
Classification accuracy  67.56%  100.00%  100.00% 
Fraction of worlds solved  0.00%  90.00%  92.50% 
Description length for f  11,338.7  198.9  198.9 
Epochs until MSE  95  83  15 
Epochs until MSE  6925  330  45 
Epochs until MSE  5403  3895  
Epochs until MSE  6590  5100  
MSE  
using 100% of data  3.78  13.89  13.89 
using 50% of data  3.84  13.76  13.81 
using 10% of data  3.16  7.38  10.54 
using 5% of data  3.06  6.06  6.20 
using 1% of data  2.46  3.69  3.95 
Epochs until MSE  
using 100% of data  95  80  15 
using 50% of data  190  152.5  30 
using 10% of data  195  162.5  30 
using 5% of data  205  165  30 
using 1% of data  397.5  235  35 
Our doublependulum mysteries (Appendix I.2) are more challenging for all the agents, because the motion is more nonlinear and indeed chaotic. Although none of our doublependulum mysteries get exactly solved according to our very stringent abovementioned criterion, Figure 5 illustrates that the Newborn agent does a good job: it discovers the two domains and classifies points into them with an accuracy of 96.5%. Overall, the Newborn agent has a median best accuracy of 91.0% compared with the baseline of 76.9%. The MSE prediction error is comparable to the baseline performance ( in the median, since both architectures have similar large capacity. We analyze this challenge and opportunities for improvement below.
Iv Conclusions
We have presented a toy “AI Physicist” unsupervised learning agent centered around the learning and manipulation of theories, which in polynomial time learns to parsimoniously predict both aspects of the future (from past observations) and the domain in which these predictions are accurate.
iv.1 Key findings
Testing it on a suite of mystery worlds involving random combinations of gravity, electromagnetism, harmonic motion and elastic bounces, we found that its divideandconquer and Occam’s razor strategies effectively identified domains with different laws of motion and reduced the meansquared prediction error billionfold, typically recovering integer and rational theory parameters exactly. These two strategies both encouraged prediction functions to specialize: the former on the domains they handled best, and the latter on the data points within their domain that they handled best. Adding the lifelong learning strategy greatly accelerated learning in novel environments.
iv.2 What has been learned?
Returning to the broader context of unsupervised learning from Section I raises two important questions: what is the difficulty of the problems that our AI physicist solved, and what is the generality of our paradigm?
In terms of difficulty, our solved physics problems are clearly on the easier part of the spectrum, so if we were to have faced the supervised learning problem where the different domains were prelabeled, the domain learning would have been a straightforward classification task and the forecasting task could have been easily solved by a standard feedforward neural network. Because the real world is generally unlabeled, we instead tackled the more difficult problem where boundaries of multiple domains had to be learned concurrently with the dynamical evolution rules in a fully unsupervised fashion. The dramatic performance improvement over a traditional neural network seen in Table 3 reflects the power of the divideandconquer and Occam’s razor strategies, and their robustness is indicated by the the fact that unsupervised domain discovery worked well even for the twofield nonlinear doublependulum system whose dynamic is notoriously chaotic and whose domain boundary is the curved rhomboid .
In terms of generality, our core contribution lies in the AI physicist paradigm we propose (combining divideandconquer, Occam’s razor, unification and lifelong learning), not in the specific implementation details. Here we draw inspiration from the history of the Turing Machine: Turing’s initial implementation of a universal computer was very inefficient for all but toy problems, but his framework laid out the essential architectural components that subsequent researchers developed into today’s powerful computers. What has been learned is that our AI physicist paradigm outperforms traditional deep learning on a test suite of problems even though it is a fully general paradigm that is was not designed specifically for these problems. For example, it is defined to work for an arbitrary number of input spatial dimensions, spatial domains, past time steps used, boundaries of arbitrary shapes, and evolution laws of arbitrary complexity.
From the abovementioned successes and failures of our paradigm, we have also learned about promising opportunities for improvement of the implementation which we will now discuss. First of all, the more modest success in the doublependulum experiments illustrated the value of learned theories being simple: if they are highly complex, they are less likely to unify or generalize to future environments, and the correspondingly complex baseline model will have less incentive to specialize because it has enough expressive power to approximate the motion in all domains at once. It will therefore be valuable to improve techniques for simplifying complex learned neural nets. The specific implementation details for the Occam’s Razor toolkit would then change, but the principle and numerical objective would remain the same: reducing their total description length from equation (4). There are many promising opportunities for this using techniques from the MonteCarloMarkovChainbased and genetic techniques Real et al. (2017), reinforcement learning Zoph and Le (2016); Baker et al. (2016) and symbolic regression Schmidt and Lipson (2009); Udrescu and Tegmark (2019) literature to simplify and shrink the model architecture. Also, it will be valuable and straightforward to generalize our implementation to simplify not only the prediction functions, but also the classifiers, for example to find sharp domain boundaries composed of hyperplanes or other simple surfaces.
Analogously, there are many ways in which the unification and lifelong learning toolkits can be improved while staying within our AI physicist paradigm. For example, unification can undoubtedly be improved by using more sophisticated clustering techniques for grouping the learned theories with similar ones. Lifelong learning can probably be made more efficient by using better methods for determining which previous theories to try when faced by new data, for example by training a separate neural network to perform this prediction task.
iv.3 Outlook
In summary, these and other improvements to the algorithms that implement our AI Physicist paradigm could enable future unsupervised learning agents to learn simpler and more accurate models faster from fewer examples, and also to discover accurate symbolic expressions for more complicated physical systems. More broadly, AI has been used with great success to tackle problems in diverse areas of physics, ranging from quantum state reconstruction Carrasquilla et al. (2019) to phase transitions Carrasquilla and Melko (2017); Wang (2016); Van Nieuwenburg et al. (2017), planetary dynamics Lam and Kipping (2018) and particle physics Baldi et al. (2014). We hope that building on the ideas of this paper may one day enable AI to help us discover entirely novel physical theories from data.
Acknowledgements: This work was supported by the The Casey and Family Foundation, the Ethics and Governance of AI Fund, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. We thank Isaac Chuang, John Peurifoy and Marin Soljačić for helpful discussions and suggestions, and the Center for Brains, Minds, and Machines (CBMM) for hospitality.
Appendix A AI Physicist Algorithm
The detailed AI Physicist algorithm is presented in Algorithm 1, with links to each of the individual subalgorithms. Like most numerical methods, the algorithm contains a number of hyperparameters that can be tuned to optimize performance; Table 5 lists them and their settings for our numerical experiments.
Appendix B The Differentiable DivideandConquer (DDAC) Algorithm
Here we elaborate on our differentiable divideandconquer (DDAC) algorithm with generalizedmean loss (Eq. (2)). This loss with works with a broad range of distance functions satisfying Theorem 1. Since the goal of our AI Physicist is to minimize the overall description length (DL) from equation (4), we choose to be the DL loss function of equation (7) together with (harmonic loss), which works quite well in practice.
Algorithm 2 describes our differentiable divideandconquer implementation, which consists of two stages. In the first stage (steps 26), it applies the subroutine with harmonic loss to train the theories a few times with the precision floor gradually lowered according to the following annealing schedule. We set the initial precision floor to be quite large so that initially approximates an MSE loss function. After each successive iteration, we reset to the median prediction error.
The DL loss function from equation (7) is theoretically desirable but tricky to train, both because it is nonconvex and because it is quite flat and uninformative far from its minimum. Our annealing schedule helps overcome both problems: initially when is large, it approximates MSEloss which is convex and guides the training to a good approximate minimum, which further training accurately pinpoints as is reduced.
The subroutine IterativeTrain forms the core of the algorithm. In the first stage (steps 26), it uses the harmonic mean of the DLloss of multiple prediction functions (i.e., equation (2) with and DL) to simultaneously train these functions, encouraging them to each specialize in the domains where they predict best (as proven by Theorem 1), and simultaneously trains the domain classifier using each example’s bestperforming prediction function as target, with categorical crossentropy loss. After several rounds of IterativeTrain with successively lower precision floors, each prediction function typically becomes good at predicting part of the dataset, and the domain classifier becomes good at predicting for each example which prediction function will predict best.
AddTheories() inspects each theory describing at least a large fraction (we use ) of the examples to see if a nonnegligible proportion of examples (we use a threshold of ) of the examples inside its domain have MSE larger than a certain limit (we use ) and thus warrant splitting off into a separate domain.
If so, it uses those examples to initialize a new theory , and performs tentative training together with other theories using IterativeTrain without steps s8 and s9 (it is also possible to allow steps s8 and s9 in this recursive calling of IterativeTrain, which will enable a recursive adding of theories for notwellexplained data, and may enable a more powerful DDAC algorithm). If the resulting loss is smaller than before adding the new theory, is accepted and retained, otherwise it is rejected and training reverts to the checkpoint before adding the theory. DeleteTheories() deletes theories whose domain or bestpredicted examples cover a negligible fraction of the examples (we use a delete threshold ).
In the second stage (steps 710), the IterativeTrain is applied again, but the loss for each example is using only the theory that the domain classifier predicts (having the largest logit). In this way, we iteratively finetune the prediction functions w.r.t. each of its domain, and finetune the domain to the best performing theory at each point. The reason that we assign examples to domains using our domain classifier rather than prediction accuracy is that the trained domains are likely to be simpler and more contiguous, thus generalizing better to unseen examples than, e.g., the nearest neighbor algorithm.
We now specify the default hyperparameters used for Algorithm 1 in our experiments (unless otherwise specified). We set the initial total number of theories , from which theories are proposed from the theory hub. The initial precision floor and the number of gradient iterations . We use the Adam Kingma and Ba (2014) optimizer with default parameters for the optimization of both the prediction function and the domain classifier. We randomly split each dataset into and with 4:1 ratio. The is used only for evaluation of performance. The batch size is set to min(2000, ). We set the initial learning rate for the prediction functions and for the domain classifier . We also use a learning rate scheduler that monitors the validation loss every 10 epochs, and divides the learning rate by 10 if the validation loss has failed to decrease after 40 monitoring points and stops training early if there is no decrease after 200 epochs — or if the entire MSE loss for all the theories in their respective domains drops below .
To the main harmonic loss , we add two regularization terms. One is loss whose strength increases quadratically from 0 to during the first 5000 epochs and remains constant thereafter. The second regularization term is a very small MSE loss of strength , to encourage the prediction functions to remain not too far away from the target outside their domain.
Appendix C Occam’s Razor with MDL Algorithm
Pushing on after the DDAC algorithm with harmonic loss that minimizes the term in Eq. (4), the AI Physicist then strives to minimize the term, which can be decomposed as , where and . We focus on minimizing , since in different environments the prediction functions can often be reused, while the domains may differ. As mentioned, we define simply as the sum of the description lengths of the numbers parameterizing :
(13) 
This means that can be significantly reduced if an irrational parameter is replaced by a simpler rational number.
If a physics experiment or neural net training produces a parameter , it would be natural to interpret this as a hint, and to check if gives an equally acceptable fit to the data. We formalize this by replacing any realvalued parameter in our theory by its nearest integer if this reduces the total description length in equation (4), as detailed below. We start this search for integer candidates with the parameter that is closest to an integer, refitting for the other parameters after each successful “integer snap”.
What if we instead observe a parameter ? Whereas generic real numbers have a closest integer, they lack a closest rational number. Moreover, as illustrated in Figure 2, we care not only about closeness (to avoid increasing the second term in equation (4)), but also about simplicity (to reduce the first term). To rapidly find the best “rational snap” candidates (dots in Figure 2 that lie both near and far down), we perform a continued fraction expansion of and use each series truncation as a rational candidate. We repeat this for all parameters in the theory , again accepting only those snaps that reduce the total description length. We again wish to try the most promising snap candidates first; to rapidly identify promising candidates without having to recompute the second term in equation (4), we evaluate all truncations of all parameters as in Figure 3, comparing the description length of the rational approximation with the description length of the approximation error . The most promising candidate minimizes their sum, i.e., lies furthest down to the left of the diagonal in the figure. The figure illustrates how, given the parameter vector , the first snap to be attempted will replace the third parameter by .
We propose Algorithm 3 to implement the above minimization of without increasing (Eq. 4). For each theory , we first extract the examples inside its domain, then perform a series of tentative transformations (simplifications) of the prediction function using the MinimizeDL subroutine. This subroutine takes , the transformation, and as inputs and repeatedly applies the transformation to . After each such transformation, it finetunes the fit of to using gradient descent. For determining whether to accept the transformation, Algorithm 3 presents the simplest 0step patience implementation: if the description length for theory decreases, then apply the transformation again if possible, otherwise exit the loop. In general, to allow for temporary increase of DL during the transformations, a nonzero patience can be adopted: at each step, save the best performing model as the pivot model, and if DL does not decrease during consecutive transformations inside MinimizeDL, exit the loop. In our implementation, we use a 4step patience.
We now detail the five transformations used in Algorithm 3. The collapseLayer transformation finds all successive layers of a neural net where the lower layer has linear activation, and combines them into one. The toSymbolic transformation transforms from the form of a neural net into a symbolic expression (in our implementation, from a PyTorch net to a SymPy symbolic lambda expression). These two transformations are onetime transformations (for example, once has been transformed to a symbolic expression, toSymbolic cannot be applied to it again.) The localSnap transformation successively sets the incoming weights in the first layer to 0, thus favoring inputs that are closer to the current time step. The integerSnap transformation finds the (nonsnapped) parameters in that is closest to an integer, and snaps it to that integer. The rationalSnap transformation finds the (nonsnapped) parameter in that has the lowest bit sum when replaced by a rational number, as described in section II.4, and snaps it to that rational number. The latter three transformations can be applied multiple times to , until there are no more parameters to snap in , or the transformation followed by finetuning fails to reduce the description length.
In the bigger picture, Algorithm 3 is an implementation of minimizing the without increasing the total , if the description length of is given by Eq. (13). There can be other ways to encode with a different formula for , in which case the transformations for decreasing may be different. But the structure of the Algorithm 3 remains the same, with the goal of minimizing without increasing w.r.t. whatever DL formula it is based on.
In the still bigger picture, Algorithm 3 is a computationally efficient approximate implementation of the MDL formalism, involving the following two approximations:

The description lengths for various types of numbers are approximate, for convenience. For example, the length of the shortest selfterminating bitstring encoding an arbitrary natural number grows slightly faster than our approximation , because selftermination requires storing not only the binary digits of the integer, but also the length of said bit string, recursively, requiring , where only the positive terms are included Rissanen (1983). Slight additional overhead is required to upgrade the encodings to actual programs in some suitable language, including encoding of whether bits encode integers, rational numbers, floatingpoint numbers, etc..

If the abovementioned formulas were made exact, they would be mere upper bounds on the true minimum description length. For example, our algorithm gives a gigabyte description length for with precision , even though it can be computed by a rather short program, and there is no simple algorithm for determining which numbers can be accurately approximated by algebraic numbers. Computing the true minimum description length is a famous numerically intractable problem.
Appendix D Unification Algorithm
The unification process takes as input the symbolic prediction functions , and outputs master theories such that by varying each p in , we can generate a continuum of prediction functions within a certain class of prediction functions. The symbolic expression consists of 3 building blocks: operators (e.g. ,,,), input variables (e.g. ), and coefficients that can be either a rational number or irrational number. The unification algorithm first calculates the DL of each prediction function, then clusters them into clusters using e.g. Kmeans clustering. Within each cluster , it first canonicalizes each into a 2tuple , where is a treeform expression of where each internal node is an operator, and each leaf is an input variable or a coefficient. When multiple orderings are equivalent (e.g. vs. ), it always uses a predefined partial ordering. is the structure of where all coefficients are replaced by an symbol. Then the algorithm obtains a set of that has the same structure with the largest cardinality (steps 78). This will eliminate some expressions within the cluster that might interfere with the following unification process. Step 9 is the core part, where it traverses each with synchronized steps using e.g. depthfirst search or breathfirst search. This is possible since each has the same tree structure . During traversing, whenever encountering a coefficient and not all coefficients across at this position are the same, replace the coefficients by some symbol that has not been used before. Essentially, we are turning all coefficients that varies across into a parameter, and the coefficients that do not vary stay as they are. In this way, we obtain a master prediction function . Finally, at step 13, the algorithm merges the master prediction functions in that have the exact same form, and return . The domain classifier is neglected during the unification process, since at different environments, each prediction function can have vastly different spacial domains. It is the prediction function (which characterizes the equation of motion) that is important for generalization.
Appendix E Adding and Proposing Theories
Here we detail the algorithms adding theories to the hub and proposing them for use in new environments. Alg. 5 provides a simplest version of the theory proposing algorithm. Given a new dataset , the theory hub inspects all theories , and for each one, counts the number of data points where it outperforms all other theories. The top theories with largest are then proposed.
For theory adding after training with DDAC (Alg. 2), each theory calculates its description length inside its domain. If its is smaller than a threshold , then the theory with its corresponding examples are added to the theory hub. The reason why the data are also added to the hub is that gives a reference for how the theory was trained, and is also needed in the Occam’s razor algorithm.
Appendix F Time complexity
In this appendix, we list crude estimates of the time complexity of our AI physicist algorithm, i.e., of how its runtime scales with key parameters.
DDAC, the differentiable divideandconquer algorithm, algorithm (Alg. 2), is run once for each of the different mystery worlds, with a total runtime scaling roughly as
where is the average number of neuralnetwork parameter in a theory, is the average number of data points (time steps) per mystery and is the number of discovered domains (in our case ). The power of two for appears because the time to evaluate the loss function scales as , and we need to perform of order training cycles to add the right number of theories. The scaling is due to that the forward and backward propagation of neural net involves successive matrix multiplied by a vector, which scales as where is the matrix dimension for each layer and is the number of layers. Accumulating all layers we have . We make no attempt to model how the number of training epochs needed to attain the desired accuracy depends on parameters.
Our Lifelong learning algorithm is also run once per mystery, with a time cost dominated by that for proposing new theories (Alg. 5), which scales as
Here is the number of theories in theory hub.
In contrast, our Occam’s Razor algorithm (Alg. 3) and unification algorithm (Alg. 4) are run once per learned theory, not once per mystery. For Occam’s Razor, the total runtime is dominated by that for snapping to rational numbers, which scales as
For the unification, the total runtime scales as , which can be neglected relative to the cost of Occam’s razor.
We note that all these algorithms have merely polynomial time complexity. The DDAC algorithm dominates the time cost; our mystery worlds were typically solved in about 1 hour each on a single CPU. If vast amounts of data are available, it may suffice to analyze a random subset of much smaller size.
Appendix G Proof of Theorem 1 and Corollary
Here we give the proof for Theorem 1, restated here for convenience.
Theorem 1 Let denote the prediction of the target by the function , .
Suppose that and for a monotonically increasing function
that vanishes on for some , with differentiable and strictly convex for .
Then if , we have
(14) 
where .
Proof. Since and , the generalized mean loss as defined in Eq. 3 can be rewritten as
(15) 
which implies that
Since only the last factor depends on , proving equation (14) is equivalent to proving that
(16) 
Let us henceforth consider only the case , since the conditions imply . Since , and , we have , so that