main.bib
Software paper for submission to the Journal of Open Research Software
To complete this template, please replace the blue text with your own. The paper has three main sections: (1) Overview; (2) Availability; (3) Reuse potential.
Please submit the completed paper to: editor.jors@ubiquitypress.com
(1) Overview
Title
Glyph: Symbolic Regression Tools
Paper Authors
1. Quade, Markus, markus.quade@ambrosys.de
2. Gout, Julien, julien.gout@ambrosys.de
3. Abel, Markus, markus.abel@ambrosys.de
Paper Author Roles and Affiliations
1. Quade, Markus, lead developer, Ambrosys GmbH, David Gilly Straße 1, Potsdam, Germany
2. Gout, Julien, core contributer, Ambrosys GmbH, David Gilly Straße 1, Potsdam, Germany
3. Abel, Markus, scientific project lead, Ambrosys GmbH, David Gilly Straße 1, Potsdam, Germany
Abstract
We present Glyph  a Python package for genetic programming based symbolic regression. Glyphis designed for usage let by numerical simulations let by real world experiments. For experimentalists, glyphremote provides a separation of tasks: a ZeroMQ interface splits the genetic programming optimization task from the evaluation of an experimental (or numerical) run. Glyphcan be accessed at https://github.com/Ambrosys/glyph. Domain experts are be able to employ symbolic regression in their experiments with ease, even if they are not expert programmers. The reuse potential is kept high by a generic interface design. Glyph is available on PyPI and Github.
Keywords
Symbolic Regression; Genetic Programming; MLC; Python
Introduction
Symbolic regression [schmidt2009distilling] is an optimization method to find an optimal representation of a function. The method is “symbolic”, because building blocks of the functions, i.e. variables, primitive functions, and operators, are represented symbolically on the computer. Genetic programming (GP) [koza1992genetic] can be implemented to find such a function for system identification [Vladislavleva2013, Quade2016] or fluid dynamical control [gout2016, MLC]. Glyphis an effort to separate optimization method and optimization task allowing domainexperts without special programming skills to employ symbolic regression in their experiments. We adopt this separation of concerns implementing a clientserver architecture; a minimal communication protocol eases its use. Throughout this paper “experiment” is meant as a synonym for any symbolic regression task including a labexperiment, a numerical simulation or data fitting.
Previous work on system identification and reverse engineering of conservation laws was reported in [schmidt2009distilling, schmidt2011automated]. Modern algorithms also include multi objective optimization [Quade2016] and advances like age fitness based genetic programming [Schmidt2011] or epigenetic local search [LaCava2016]. There exist various approaches to the representation of multi IO problems, including stack or graphbased representations and pointers [LaCava2016, GalvanLopez2008].
Implementation and architecture
Glyphis intended as a lightweight framework to build an application finding an optimal system representation given measurement data. The main application is intended as system control, consequently a control law is determined and returned. Glyphis built on the idea of loose coupling such that dependencies can be released if wanted.
A typical control application consists of a system and its controller, possibly separated, cf. Fig. 1. Glyphhas three main abstractions to build such an application: i) the assessment, which holds all methods and data structures belonging to the experiment, ii) the GP which is responsible for the system identification, and iii) the application components, which constitute an application.
Building an Application
An application consists of a GP callable, the gp_runner, an assessment callable for input, the assessment_runner, and the application which uses both of these classes and holds all applicationrelevant details. A commandline application is built by
The assessment_runner has one argument, the parallel_factory which implements a map() method, possibly parallel. For an application one needs to implement setup, assign_fitness, and measure: setup is selfexplaining, measure is a key method which takes as input a set of measurement functions and combines them into a tuple of callable measures for multiobjective optimization. The measures are used eventually in assign_fitness where the return values are used to assign a fitness to an individual from GP. The interface is freely extensible. A gp_runner forwards the evolutionary iteration. It takes as arguments, gp_args, an individual class, a gp_algorithm, and an assessment_runner. The individual class contains the representation of a function, the individual; it is currently based on deap’s treebased implementation. The gp_algorithm takes care for the breeding and selection steps, its principles are described in [koza1992genetic].
The application is run in the main function with app.run(). Each of the highlevel functions contains a bunch of nextlevel instructions, and can be built with a minimal assembly of methods.
In the application and gp_runner, the user has freedom to add functionality using the list of callbacks in the arguments, say, to implement other logging or streaming options. This allows for very flexible programming. We constructed the components that way to allow users to specialize for their particular experiments and possibly increase performance or extend the symbolic regression, e.g. by replacing the deap treebased representation of an individual.
Remote Control
One main objective of glyph is its use in a real experiment. In this case, the GP loop is separated from the experimental loop in a clientserver setup using ZeroMQ [Akgul:2013:ZER:2523409], cf. Fig. 2.
Consequently, one should implement the interface to the experiment using the protocol described in Sec. Communication Protocol. Having the implementation of the experiment, the server, one needs to implement the client, i.e. the interface to the gp_runner. In essence this means connecting the correct sockets whith ZeroMQ and ensuring that the gp_runner and the assessment_runner use the corresponding sockets. Then, the main application is assembled as before, now using a RemoteApp for the main application, which in turn uses a gp_runner, which then uses now a RemoteAssessmentRunner. That is it, we can run remotely our GP evaluation from some client and the experiment in place of the experiment.
Communication Protocol
The communication is encoded in json [ecma404]. A message is a json object with two members:
The possible values are listed in Table 1.
Action name  Payload  Expected return Value 

CONFIG  –  config settings 
EXPERIMENT  list of expressions  list of fitness value(s) 
SHUTDOWN  –  – 
The config action is performed prior to the evolutionary loop. Entering the loop, discovered solutions will be batched and a experiment action will be requested. You can configure optional caching for rediscovered solutions. This includes persistent caching between different runs. The shutdown action will let the experiment program know that the gp loop is finished and you can safely stop the hardware.
Configuration settings are sent as a json object in key:value form, where the keys contain the option to be set, there is only one mandatory option: the primitive set. To configure the primitive set, the primitive names are passed as content of the key config, whose values specify the corresponding arities, both fields described again as json object.
The experiment action sends a list of expressions, encoded as string in prefix (also: polish) notation [jorke1989arithmetische]. For each expression sent, the experiment returns a fitness tuple.
Additionally, one can define the type of algorithm, error metric, representation, hyperparameters, etc. A comprehensive up to date list can be found at http://glyph.readthedocs.io/en/latest/usr/glyph_remote/.
Application example: control of the chaotic Lorenz System
In the following, we demonstrate the application and use of Glyphby the determination of an unknown optimal control law for a chaotic system. As an example, we study the control of the potentially chaotic Lorenz system. Chaotic systems are very hard to predict and control in practice due to their sensitivity towards small changes in the initial state which may lead to exponential divergence of trajectories. The Lorenz model [Lorenz1963] consists of a system of three ordinary differential equations:
(1) 
with two nonlinearities, and . Here , , and make up the system state and , , are parameters: is the Prandtl number, is the Rayleigh number, and b is related to the aspect ratio of the air rolls. For a certain choice of parameters and initial conditions chaotic behavior emerges.
Here we present two examples where the target is to learn control of bring a chaotic Lorenz system to a complete stop, that is, (). In the first example, the actuator term is applied to . This allows for a more direct control of the system, since appears in every equation of (1) and, thus, influence all three state components, , , and . In the second example the actuator term is applied to , which leads to a more indirect control, since the flow of information from to is only through .
population size  

max. generations  
MOO algorithm  NSGAII 
tree generation  halfandhalf 
min. height  1 
max. height  4 
selection  selTournament 
tournament size  
breeding  varOr 
recombination  cxOnePoint 
crossover probability  
crossover max. height  
mutation  mutUniform 
mutation probability  
mutation max. height  
constant optimization  leastsq 
dynamic system  GP  

cost functionals  RMSE  
RMSE  
RMSE  
length  
argument set  
constant set  
seed (in )  
seed (in ) 
The system setup is summarized in Table 2 and Table 3. When , , and , the Lorenz system produces chaotic solutions (not all solutions are chaotic). Almost all initial points will tend to an invariant set – the Lorenz attractor – a strange attractor and a fractal. When plotted the chaotic trajectory of the Lorenz system resembles a butterfly (blue graph in Fig. 3). The target of control is, again, formulated as RMSE of the system state with respect to zero (separately for each component)
The control function can make use of ideal measurements of the state components. Constant optimization is performed on a single constant . The respective GP runs for control in and control in are conducted with the corresponding random seeds labeled “in ” and “in ”.
Control in :
For control in the actuator term is added to the left side of the equation for in the uncontrolled system (1):
The Pareto solutions from the GP run are shown in Table 4. The wide spread of the cost indices is a sign of conflicting objectives that are hard to satisfy in conjunction. Interestingly, almost all solutions, , commonly introduce a negative growth rate into . This effectively drives to zero and suppresses the growth terms, and , in the equations for and respectively, in turn, driving and to zero as well. As would be expected, minimal expressions, of length 1 or 2, cannot compete in terms of the RMSE. For example, the simple solution, (fourth row), is almost as good as the lengthier one, (first row), and even better in RMSE.
RMSE  RMSE  RMSE  length  expression  constants 

Table 4 shows the results from the GP run. One solution immediately stands out: , with (second row). It is exactly what one might expect as a control term for the chaotic Lorenz system with control in . This control law effectively reduces the Rayleigh number to a value close to zero (), pushing the Lorenz system past the first pitchfork bifurcation, at , back into the stableorigin regime. If then there is only one equilibrium point, which is at the origin. This point corresponds to no convection. All orbits converge to the origin, which is a global attractor, when .
The phase portrait of the solution from the first and second row of Table 4 are illustrated in Fig. 3. After a short excursion in negative direction (), the green trajectory quickly converges to zero. The red trajectory seems to take a shorter path in phase space, but, it is actually slower to converge to the origin. This is verified by a plot of the trajectories for the separate dimensions , and over time Fig. 4.
Control in :
For control in the actuator term is added to the left side of the equation for in the uncontrolled system (1)
Selected Paretofront individuals from the GP run are displayed in Table 5. As mentioned at the beginning of this section, effective control is hindered by the indirect influence of on the other state variables, hence, it is not surprising that the control laws here are more involved than in the previous case. Also, they generally do not perform well in the control of , which is expressed by the relatively high values in RMSE. This is confirmed by the phase portrait of the solution shown in figure Fig. 5: While going straight to the origin in the plane there are strong oscillations of the trajectory along the axis.
The dynamics caused by the actuation, e.g. for the best control law found, can be explained qualitatively: there is a strong damping in all variables but . This reflects the tendency to suppress oscillations and, at the same time, to add damping in through the term: if grows, the contribution to damping on the right hand side of the Lorenz equations (1) grows and, in turn, damps . This is, however, only possible to some extent, hence, the oscillations observed in figure Fig. 5.
RMSE  RMSE  RMSE  length  expression  constants 

We conclude the demonstration with a short summary: Using Glyphwe can find complex control laws, even for unknown systems. This cannot be easily achieved with other frameworks. The control laws found can be studied analytically in contrast to several other methods which have blackbox character. The usage is straightforward, as we have described above. The above example can be found online as an example.
Other symbolic regression libraries
Due to its popularity, symbolic regression is implemented by most genetic programming libraries. A semicurated list can be found at http://geneticprogramming.com/software/. In contrast to other implementations, Glyphimplements higher concepts, such as symbolic constant optimization, and also offers parallel execution for complex examples (control simulation, system identification). Glyphis well tested, cf. Table 6 and currently applied in two experiments and several numerical problems. For control, there exists a dedicated matlab toolbox (with python interface), openMLC [machinelearningcontrol_2017], which contains much of the material treated in [MLC].
CI/tests  doc  open api  caching  checkpointing  MOGP  SCO  MO  

openMLC  ✓  ✗  ✗  ✓  ✓  ✗  ✗  ✗ 
Glyph  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
Quality control
Continuous Integration tests are conducted for Mac, Linux and Windows using Travis and AppVoyer. The tests consider Python version 3.5 and 3.6 . Unit test coverage is around 85% as reported by codecov. Additionally, tests specifically cover the stochastic parts of the optimization to ensure reproducibility. Along with the software tests are shipped which guarantee the correct execution of the examples. The user can reuse these tests for further development. Locally, tests can be executed via the pytest command.
(2) Availability
Operating system
Glyphis compatible with Mac, Linux and Windows.
Programming language
Python 3.5+
Dependencies
Currently, Glyphis based on DEAP [DEAP_JMLR2012], an evolutionary computation framework adopting a toolboxlike structure for rapid prototyping. Further dependencies are found uptodate at https://github.com/Ambrosys/glyph/blob/master/requirements.txt.
List of contributors
Core contributors (prior to open source): Markus Quade Julien Gout
Open source contributors can be found at https://github.com/Ambrosys/glyph/graphs/contributors.
Software location:
Archive Zenodo
 Name:

Ambrosys/glyph
 Persistent identifier:
 Licence:

LGPL
 Publisher:

Markus Quade
 Version published:

0.3.3
 Date published:

12.09.17
Code repository Github
 Name:

glyph
 Persistent identifier:
 Licence:

LGPL
 Date published:

08.12.16
Language
English
(3) Reuse potential
The potential to use Glyphis twofold: one one hand applications can be easily written and the elegant core functionality can be extended; on the other hand, researchers can use the code as core for symbolic regression and extend its functionality in a very generic way. With respect to applications, currently two main directions are targeted: modeling using genetic programming based symbolic regression and the control of complex system, where a control law can be found generically, using genetic programming. The detailed examples and tutorial allow usage from beginner to experienced level, i.e. undergraduage research projects to faculty research. The design of Glyphis such that generic interfaces are provided allowing for very flexible extension.
Acknowledgements
We acknowledge very fruitful discussions with respect to applications of MLC with S. Brunton, B. Noack, A. Pikovsky, M. Rosenblum, R. Semaan, and B. Strom and coding gossip with V. Mittal.
Funding statement
This work has been partially supported by the German Science Foundation via SFB 880. MQ was supported by a fellowship within the FITweltweit program of the German Academic Exchange Service (DAAD).
Competing interests
The authors declare that they have no competing interests.