Perceiving Physical Equation by Observing Visual Scenarios
Inferring universal laws of the environment is an important ability of human intelligence as well as a symbol of general AI. In this paper, we take a step toward this goal such that we introduce a new challenging problem of inferring invariant physical equation from visual scenarios. For instance, teaching a machine to automatically derive the gravitational acceleration formula by watching a free-falling object. To tackle this challenge, we present a novel pipeline comprised of an Observer Engine and a Physicist Engine by respectively imitating the actions of an observer and a physicist in the real world. Generally, the Observer Engine watches the visual scenarios and then extracting the physical properties of objects. The Physicist Engine analyses these data and then summarizing the inherent laws of object dynamics. Specifically, the learned laws are expressed by mathematical equations such that they are more interpretable than the results given by common probabilistic models. Experiments on synthetic videos have shown that our pipeline is able to discover physical equations on various physical worlds with different visual appearances.
Perceiving Physical Equation by Observing Visual Scenarios
Siyu Huang††thanks: Equal contributions. This work was done when Siyu Huang and Zhi-Qi Cheng were visiting Carnegie Mellon University. Zhejiang University firstname.lastname@example.org Zhi-Qi Cheng11footnotemark: 1 Southwest Jiaotong University email@example.com Xi Li Zhejiang University firstname.lastname@example.org Xiao Wu Southwest Jiaotong University email@example.com Zhongfei (Mark) Zhang Zhejiang University firstname.lastname@example.org Alexander Hauptmann Carnegie Mellon University email@example.com
noticebox[b]NIPS 2018 Workshop on Modeling the Physical World. Montréal, Canada.\end@float
Inference is one of the most basic and significant aspects of human intelligence tenenbaum2011grow () as well as AI lake2017building (). As a high-level aspect of inference, the induction of universal laws from observations of our world is both the core basis and the goal of the scientific research. For example, Sir Isaac Newton saw an apple falling down and then was inspired to discover the law of gravitation. However, for a computing machine, the induction of laws based on visual observations is still a very challenging and open problem, and has been rarely explored by the existing literature until today.
In this paper, we introduce a new problem that we attempt to teach machine to automatically derive mathematical expressions of object dynamics from videos of a physical world. In contrast to the most recent approaches battaglia2016interaction (); watters2017visual (); wu2017learning () which explores to learn object mechanical behaviors by the black box of deep neural networks, we aim at explicitly presenting the symbolic expressions of latent physical laws, leading to a more interpretable model and more visualizable results. A pioneer work schmidt2009distilling () learns to derive mathematical equations from the data of physical experiments. While in this work, we propose to learn mathematical expressions directly from complicated videos.
Toward this goal, we propose a novel pipeline comprised of an Observer Engine and a Physicist Engine. The Observer Engine acts like an observer that watches the videos of a physical scenario and extracts the physical properties of objects in that scenario. Then the Physicist Engine imitates a physicist that summarizes the observed data and finally derives the mathematical equations.
In the experiments, we evaluate our pipeline on synthetic videos of multiple physical scenarios, showing that it is able to learn precise mathematical equations on these physical worlds with diverse visual appearances. We also explore several variants of models for the Observer Engine and the Physicist Engine respectively, so as to quantitatively establish baselines for relevant research in the future.
Our contributions are three-fold. First, we introduce a new problem of learning mathematical equations of object dynamics from videos, taking a step toward the automatic induction of universal laws for general AI. Second, we propose a novel pipeline to tackle this challenging problem. Third, empirical studies demonstrate the effectiveness of our approach on several synthetic physical scenarios.
2 Related Work
Physical reasoning has drawn much attention of AI researchers in recent years. Previous work on physical reasoning explored to learn the common sense knowledge of physical scenarios battaglia2013simulation (); goyal2017something () and to develop the simulation techniques for inferring the future states of physical systems fragkiadaki2015learning (); mottaghi2016happens (); wang2018learning (). A typical example is to predict whether a stack of blocks would fall gupta2010blocks (); lerer2016learning (); li2016fall (). In addition, the simulation and prediction of macroscopic physical phenomena, including weather events racah2017extremeweather () and fluid jeong2015data (); de2017deep (), were also studied by researchers. The “NeuroAnimator” grzeszczuk1998neuroanimator () was the pioneer work to quantitatively simulate the physical dynamics of articulated bodies with neural networks. Today, the learning of object dynamics mottaghi2016newtonian (); byravan2017se3 (); chang2016compositional (); ehrhardt2017learning () becomes a research hotspot.
More recently, researchers incorporated the powerful deep neural networks into physical reasoning systems to enable a deeper understanding of the physical properties underlied in visual scenarios. Interaction Network (IN) battaglia2016interaction () and Visual Interaction Network (VIN) watters2017visual () were successively proposed for modeling the dynamic relationships between physical objects in videos. Wu et al. wu2017learning () end-to-end learned a hybrid of graphics engines and physics engines to predict the long-term visual observations of a physical world.
In these approaches, the dynamics of objects and their interactions are generally modeled by the non-linear transformations of neural networks which are black-box models. The explicit symbolic expressions of object kinetic properties are not revealed and interpreted. In this work, we take the first step toward the interpretable physical reasoning model in which we attempt to summarize the object kinetic properties as precise physical equations through observing videos of a physical world.
In this work, we concentrate on inferring a physical equation from the visual scenarios, which is rarely explored in the existing literature. From the perspective of equation regression, there have been many efforts on learning the symbolic relationships from non-structured data teodorescu2008high (); sutskever2009using (); uy2011semantically (). In another aspect, several approaches bhat2002computing (); brubaker2009estimating () learned to fit the parameters of Newtonian mechanics equations to physical systems depicted by videos. For instance, Wu et al. wu2015galileo (); wu2016physics () proposed a deep learning model to infer the physical properties (such as mass, volume, and coefficient friction) of objects from real-world videos. However, the symbolic expression of physical equations themselves are still not learned in these approaches.
A popular method for the learning of mathematical expression is called “symbolic regression” augusto2000symbolic (); giustolisi2006symbolic (), which is adopted in this work for the generation of physical equations. Symbolic regression is a machine learning technique that identifies a mathematical expression to minimize the customized error metric based on genetic programming koza1994genetic () and evolutionary algorithm mckay1995using (). Unlike the traditional linear and non-linear regression methods that fit parameters to an equation of a given form, symbolic regression searches both the parameters and the form of equations simultaneously schmidt2009distilling (). More details of our approach are discussed in the following sections.
Our model learns to infer the inherent mathematical equation from video frames of a physical system. It consists of an Observer Engine and a Physicist Engine.
The Observer Engine acts like an observer that watches the videos of a physical world, and at the same time records the physical-property variables. As illustrated in Fig. 2(a), it captures the physical properties of the kinetic objects and the environment in videos. In this work, we use the Faster-RCNN ren2015faster () model to detect an object and localize its position according to coordinates of the bounding-boxes. In order to get a more precise object position, we employ a two-stage approach to refine the position on coarse-to-fine spatial scales. Specifically, a Faster-RCNN detector is applied on an image to get a coarse window of an object, then another Faster-RCNN detector is applied on the window to get a fine bounding-box. The two-stage approach ensures a precise object localization and a speed up of the detection procedure. The velocity of an object is computed by , where is the time interval between two video frames. Observation data , , and are fed to the Physicist Engine serving as the independent variables.
The Physicist Engine acts like a physicist that infers the equation based on the observations given by the Observer Engine. It takes a set of objects’ physical properties (output from the visual engine applied to a series of videos) as input. It outputs the equation between displacement and the independent variables. In this work, we adopt symbolic regression with genetic programming (GP) augusto2000symbolic (); giustolisi2006symbolic () for the inference of mathematical equation, implemented based on GPlearn Toolkit111http://gplearn.readthedocs.io/en/stable/index.html.
As illustrated in Fig. 2(b), the formula is represented as a syntax tree. The variables, denoted as the round nodes, are leaves of the tree. The mathematical operations, denoted as the square nodes, connect the independent variables. Our goal is to find the best formula consisting of arbitrary independent variables and mathematical operations to minimize the mean absolute error (MAE) corresponding to the given target. At the very beginning, a population of formulas is randomly initialized. In an evolutionary manner, GP evolves the fittest ones of every generation until convergence. More details of GP are discussed in Section 4.2.
4.1 Physical Scenarios
We conduct experiments on five types of physical scenarios. In each scenario, there is an object obeying the basic dynamic equation as
is the displacement vector and is the velocity vector. is the accelerated velocity vector corresponding to the specific object dynamics of each physical scenario, including
Drift There is no external force applied on the object. The object drifts with its initial velocity. The accelerated velocity is
Free-falling The object goes into free-falling under gravity. The accelerated velocity is
is the gravitational acceleration constant.
Parabola The object moves along a parabola under gravity. The accelerated velocity is the same as defined in Eq. 3, while the object has a random initial horizontal velocity .
Slope The object slides downhill on a smooth slope. The accelerated velocity is
where is the slope gradient.
Spring The object is connected to a horizontal wall with a visible spring obeying Hooke’s law. The accelerated velocity is
The Hooke’s constant , attachment point y-coordinate , and equilibrium distance are constants in an experiment.
For each physical scenario, we generate 300 videos for training the Observer Engine, and 100 videos for testing our pipeline, where each video has 100 frames. To simulate the real-world scenarios, by following watters2017visual () we use a random Cifar-10 krizhevsky2009learning () natural image as the background of each synthetic video. There is no overlap of background images between training set and testing set. The image size of a video frame is set as 38K38K because a larger image size enables a smaller relative error when estimating the object position. The sizes of objects in videos are the same. As for the constant independent variables, we fix , , , and in the experiments. We do not fix the slope gradient as it is an observable variable. The object initial position, the object initial velocity, object mass, and the slope gradient are random in every video.
4.2 Learning Details
In the Observer Engine, we use a two-stage Faster-RCNN object detector whose backbone network is the pretrained ResNet-101 he2016deep () model. In each stage, 2,000 and 1,000 images are randomly sampled as training set and validation set respectively. In the first stage, we train an object detector to detect objects on the original video frames (38K38K pixels). The object detector is trained by the SGD optimizer with a learning rate of 0.005, a batch size of 4, and a learning rate decay of 8. After 4 epochs of training, the detection model gets converged and obtains a 95.5% MAP on validation set. In the second stage, we crop a 4K4K part from the original image with the bounding box output by the first stage. Then we train another object detector to refine to a more precise object position. The detector is trained by the SGD optimizer with a learning rate of 0.001, a batch size of 1, and a learning rate decay of 4. The detection model gets converged after 6 epochs and obtains a 97.7% MAP on validation set. The object position is estimated as the center point of the bounding box output by the second detector. Table 1 shows that the euclidean distance error of our estimated position is less than 1 pixel in average.
In the Physicist Engine, we use genetic programming to evolve the syntax tree which represents a mathematical equation. The independent variables include position , , velocity , , mass , and time interval . The mass of object is set as known by the Physicist Engine as it could be easily estimated in the real world. In scenario of Slope, the independent variables also include and . We do not use as independent variable, because the difference between and is numerically trivial for regression under small . The arithmetic operations, including addition (), subtraction (), multiplication (), and division (), are used for every scenario. Genetic operations including crossover (), subtree mutation (), hoist mutation (), and point mutation () are employed in evolution.
5.1 Perceiving Mathematical Equation
We show that our pipeline is able to perceive mathematical equations on a variety of physical scenarios with diverse visual appearances in Fig. 3. Five different physical scenarios are shown in the first column, where the objects are moving under corresponding dynamic equations. Our Observer Engine detects the bounding boxes (green) of objects, providing precise object positions to the Physicist Engine. At the right part of Fig. 3, we show the mathematical equations and syntax trees learned by the Physicist Engine, where -component and -component of displacement are respectively shown in the second column and the third column.
Fig. 3 demonstrates that the Physicist Engine can learn dynamic equations of all the physical scenarios, even though the dynamic equations of some scenarios (Slope and Spring) are complex. Not only the symbolic relationships are correctly learned, the physical constants in mathematical equations are also accurately estimated by our method (e.g., the ground truth in scenarios of Free-falling, Parabola, and Slope; the ground truth in scenario of Spring).
Please note that every equation in Fig. 3 is generated based on the same independent variables (except particular arguments of environment) and the same arithmetic operations across all the scenarios. It reveals that our method is scalable to many other physical systems which are not included in the experiments of this work. In addition, our Observer Engine is effective in complex real-world background images with diverse visual appearances, indicating that our method is probably able to be applied to real-world videos in a future study.
We have explored several variants of models for the Observer Engine and the Physicist Engine respectively, so as to quantitatively establish baselines for our newly proposed problem.
For the Observer Engine, we study two baseline methods for a quantitative comparison with our used Two-stage Detector:
Single Detector: It is the basic Faster R-CNN ren2015faster () model, where the Region Proposal Network (RPN) is used for estimating the bound of object. The bound is used for computing the position of object.
Detection + Segmentation: First, a Single Detector is used for getting an object bound. Then, a fully convolutional network (FCN) is used for segmenting the object in the pre-detected bound to localize a more accurate object position.
Two-stage Detector: The method adopted by this work. Two Single Detectors are stacked to detect the object in a coarse-to-fine strategy.
|Detection + Segmentation||5.26|
Table 1 shows the baseline performances of the Observer Engine, under the metric of mean Euclidean distance (MED) between the estimated position and the ground-truth position. Comparing Detection + Segmentation to Single Detector, the segmentation operation is able to refine the output of single RCNN model by about 8X. Comparing Two-stage Detector to Single Detector, the second detector successfully reduces the error by about 72X based on output of the first detector. The Two-stage Detector used in this work shows a surprising performance such that the mean error is 0.55 pixel under the 38K38K coordinate system, indicating that it can be extended to various visual scenarios and real-world applications.
For the Physicist Engine, we also study a series of common regression methods for a comparison with the symbolic regression algorithm used in this work. The baselines include (1) linear regression, (2) ridge regression, (3) decision tree, and (4) random forest. These models are implemented based on the scikit-learn toolbox scikit-learn (). Fig. 4 shows the baseline performances of the Physicist Engine, under the metric of mean absolute percentage accuracy (MAPA) between ground-truth displacement and estimated displacement as
MAPA denotes the relative estimation accuracy. In Fig. 4, baseline methods perform well on scenarios of Free-falling, Parabola, and Slope. While on scenarios of Drift and Spring, methods show distinct difference such that decision tree and random forest perform significantly better than linear regression and ridge regression. The main reason is that decision tree and random forest are non-linear models thus having much better non-linear representation capabilities than linear/ridge regression. Our symbolic regression algorithm performs the best such that its accuracy is almost 1.0 on every scene. Apparently the syntax tree of symbolic regression can perfectly represent the relationships (such as multiplication and division) between independent variables, such that symbolic regression is naturally suited for learning mathematical equations.
|Detection + Segmentation||0.958||0.958||0.832||0.922||0.954||0.954|
Table. 2 shows a more comprehensive ablation study of our pipeline, where the baseline methods of two engines are pairwise combined to be evaluated in all the physical scenarios. We use coefficient score as the metric to evaluate the fitting goodness in this study. It is interesting that when working with Single Detector or Detection + Segmentation, sometimes the methods of Physicist Engine perform better than the ground-truth equation. It is mainly because these methods eliminate some position errors in fitting. We observe that our pipeline (a combination of Two-stage Detector and SR) gets an 1.000 R score, as it successfully identifies all of the dynamic equations as well as accurately estimates the constants, as shown in Fig. 3. Comparing Two-stage Detector with Ground-truth Position and comparing SR with GT, both methods show performances close to the ground-truth, indicating that they have good compatibilities with different methods of the other engine.
We have introduced a new problem of deriving mathematical equations from physical scenarios, taking a step toward the goal of reasoning about universal laws from a complex environment. We have presented a pipeline including an Observer Engine and a Physicist Engine to tackle this problem for the first time. In the experiments, we have shown that our pipeline is able to perceive dynamic equations on various physical scenarios whose visual appearances are quite different. Ablation studies conducted on combinations of baselines further demonstrate the effectiveness of our pipeline. In general, our pipeline is an effective template for reasoning about the physical and dynamic systems. By combining deep learning, symbolic learning, and evolutionary algorithm, we show the potential of a hybrid machine learning system for AI reasoning. We hope this work may inspire future study on inference, induction, and conceptual understanding of general AI.
In the future, an important work is to demonstrate the proposed pipeline in real-world scenarios which may have more unknown noise than the synthetic data. It will also be important to develop techniques to handle the multi-object physical system battaglia2016interaction (); watters2017visual (); wu2017learning (), in which there are interactions between objects other than the dynamics of a single object. It is a challenging and meaningful task to learn equations of a composite set of dynamic laws. In addition, our pipeline is probably able to be extended to some practical applications, e.g., helping physicists to summarize and analyse the experimental data in complex visual scenarios.
-  Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022):1279–1285, 2011.
-  Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
-  Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In NIPS, pages 4502–4510, 2016.
-  Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In NIPS, pages 4542–4550, 2017.
-  Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh Tenenbaum. Learning to see physics via visual de-animation. In NIPS, pages 152–163, 2017.
-  Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science, 324(5923):81–85, 2009.
-  Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45):18327–18332, 2013.
-  Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
-  Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of physics for playing billiards. 2016.
-  Roozbeh Mottaghi, Mohammad Rastegari, Abhinav Gupta, and Ali Farhadi. “what happens if…” learning to predict the effect of forces in images. In ECCV, pages 269–285. Springer, 2016.
-  Zhihua Wang, Stefano Rosa, Bo Yang, Sen Wang, Trigoni Niki, and Andrew Markham. 3d-physnet: Learning the intuitive physics of non-rigid object deformations. In IJCAI, 2018.
-  Abhinav Gupta, Alexei A Efros, and Martial Hebert. Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV, pages 482–496. Springer, 2010.
-  Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. In ICML, pages 430–438, 2016.
-  Wenbin Li, Seyedmajid Azimi, Aleš Leonardis, and Mario Fritz. To fall or not to fall: A visual approach to physical stability prediction. arXiv preprint arXiv:1604.00066, 2016.
-  Evan Racah, Christopher Beckham, Tegan Maharaj, Samira Ebrahimi Kahou, Mr Prabhat, and Chris Pal. Extremeweather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In NIPS, pages 3402–3413, 2017.
-  SoHyeon Jeong, Barbara Solenthaler, Marc Pollefeys, Markus Gross, et al. Data-driven fluid simulations using regression forests. ACM Transactions on Graphics, 34(6):199, 2015.
-  Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes: Incorporating prior scientific knowledge. ICLR, 2018.
-  Radek Grzeszczuk, Demetri Terzopoulos, and Geoffrey Hinton. Neuroanimator: Fast neural network emulation and control of physics-based models. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 9–20, 1998.
-  Roozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonian scene understanding: Unfolding the dynamics of objects in static images. In CVPR, pages 3521–3529, 2016.
-  Arunkumar Byravan and Dieter Fox. Se3-nets: Learning rigid body motion using deep neural networks. In ICRA, pages 173–180, 2017.
-  Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. ICLR, 2017.
-  Sebastien Ehrhardt, Aron Monszpart, Niloy J Mitra, and Andrea Vedaldi. Learning a physical long-term predictor. arXiv preprint arXiv:1703.00247, 2017.
-  Liliana Teodorescu and Daniel Sherwood. High energy physics event selection with gene expression programming. Computer Physics Communications, 178(6):409–419, 2008.
-  Ilya Sutskever and Geoffrey E Hinton. Using matrices to model symbolic relationship. In NIPS, pages 1593–1600, 2009.
-  Nguyen Quang Uy, Nguyen Xuan Hoai, Michael O’Neill, Robert I McKay, and Edgar Galván-López. Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genetic Programming and Evolvable Machines, 12(2):91–119, 2011.
-  Kiran S Bhat, Steven M Seitz, Jovan Popović, and Pradeep K Khosla. Computing the physical parameters of rigid-body motion from video. In ECCV, pages 551–565. Springer, 2002.
-  Marcus A Brubaker, Leonid Sigal, and David J Fleet. Estimating contact dynamics. In ICCV, pages 2389–2396, 2009.
-  Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In NIPS, pages 127–135, 2015.
-  Jiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning physical object properties from unlabeled videos. In BMVC, volume 2, page 7, 2016.
-  Douglas Adriano Augusto and Helio JC Barbosa. Symbolic regression via genetic programming. In Proceedings. Sixth Brazilian Symposium on Neural Networks, pages 173–178, 2000.
-  Orazio Giustolisi and Dragan A Savic. A symbolic data-driven technique based on evolutionary polynomial regression. Journal of Hydroinformatics, 8(3):207–222, 2006.
-  John R Koza. Genetic programming as a means for programming computers by natural selection. Statistics and Computing, 4(2):87–112, 1994.
-  Ben McKay, Mark J Willis, and Geoffrey W Barton. Using a tree structured genetic algorithm to perform symbolic regression. In International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, pages 487–492, 1995.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.