(1) Overview
\addbibresource

main.bib

Software paper for submission to the Journal of Open Research Software

To complete this template, please replace the blue text with your own. The paper has three main sections: (1) Overview; (2) Availability; (3) Reuse potential.

Please submit the completed paper to: editor.jors@ubiquitypress.com

 

(1) Overview

Title

Glyph: Symbolic Regression Tools

Paper Authors

1. Quade, Markus, markus.quade@ambrosys.de
2. Gout, Julien, julien.gout@ambrosys.de
3. Abel, Markus, markus.abel@ambrosys.de

Paper Author Roles and Affiliations

1. Quade, Markus, lead developer, Ambrosys GmbH, David Gilly Straße 1, Potsdam, Germany
2. Gout, Julien, core contributer, Ambrosys GmbH, David Gilly Straße 1, Potsdam, Germany
3. Abel, Markus, scientific project lead, Ambrosys GmbH, David Gilly Straße 1, Potsdam, Germany

Abstract

We present Glyph - a Python package for genetic programming based symbolic regression. Glyphis designed for usage let by numerical simulations let by real world experiments. For experimentalists, glyph-remote provides a separation of tasks: a ZeroMQ interface splits the genetic programming optimization task from the evaluation of an experimental (or numerical) run. Glyphcan be accessed at https://github.com/Ambrosys/glyph. Domain experts are be able to employ symbolic regression in their experiments with ease, even if they are not expert programmers. The reuse potential is kept high by a generic interface design. Glyph is available on PyPI and Github.

Keywords

Symbolic Regression; Genetic Programming; MLC; Python

Introduction

Symbolic regression [schmidt2009distilling] is an optimization method to find an optimal representation of a function. The method is “symbolic”, because building blocks of the functions, i.e. variables, primitive functions, and operators, are represented symbolically on the computer. Genetic programming (GP) [koza1992genetic] can be implemented to find such a function for system identification [Vladislavleva2013, Quade2016] or fluid dynamical control [gout2016, MLC]. Glyphis an effort to separate optimization method and optimization task allowing domain-experts without special programming skills to employ symbolic regression in their experiments. We adopt this separation of concerns implementing a client-server architecture; a minimal communication protocol eases its use. Throughout this paper “experiment” is meant as a synonym for any symbolic regression task including a lab-experiment, a numerical simulation or data fitting.

Previous work on system identification and reverse engineering of conservation laws was reported in [schmidt2009distilling, schmidt2011automated]. Modern algorithms also include multi objective optimization [Quade2016] and advances like age fitness based genetic programming [Schmidt2011] or epigenetic local search [LaCava2016]. There exist various approaches to the representation of multi IO problems, including stack- or graph-based representations and pointers [LaCava2016, Galvan-Lopez2008].

Implementation and architecture

Glyphis intended as a lightweight framework to build an application finding an optimal system representation given measurement data. The main application is intended as system control, consequently a control law is determined and returned. Glyphis built on the idea of loose coupling such that dependencies can be released if wanted.

Figure 1: Left: A typical closed loop control task is sketched. Given a system , some measurements and a control law and we can control the system by adding the actuation . Right: gp-based symbolic regression finds different candidate control laws. Each candidate solution is given a fitness score which is used to compare different solutions and to advance the search in function space. Figure adapted from [gout2016] with permission.

A typical control application consists of a system and its controller, possibly separated, cf. Fig. 1. Glyphhas three main abstractions to build such an application: i) the assessment, which holds all methods and data structures belonging to the experiment, ii) the GP which is responsible for the system identification, and iii) the application components, which constitute an application.

Building an Application

An application consists of a GP callable, the gp_runner, an assessment callable for input, the assessment_runner, and the application which uses both of these classes and holds all application-relevant details. A command-line application is built by

    assessment_runner = AssessmentRunner(assess_args)
    gp_runner = glyph.application.GPRunner(gp_args)
    app = command_line_application(app_args)

The assessment_runner has one argument, the parallel_factory which implements a map() method, possibly parallel. For an application one needs to implement setup, assign_fitness, and measure: setup is self-explaining, measure is a key method which takes as input a set of measurement functions and combines them into a tuple of callable measures for multiobjective optimization. The measures are used eventually in assign_fitness where the return values are used to assign a fitness to an individual from GP. The interface is freely extensible. A gp_runner forwards the evolutionary iteration. It takes as arguments, gp_args, an individual class, a gp_algorithm, and an assessment_runner. The individual class contains the representation of a function, the individual; it is currently based on deap’s tree-based implementation. The gp_algorithm takes care for the breeding and selection steps, its principles are described in [koza1992genetic].

The application is run in the main function with app.run(). Each of the high-level functions contains a bunch of next-level instructions, and can be built with a minimal assembly of methods.

In the application and gp_runner, the user has freedom to add functionality using the list of callbacks in the arguments, say, to implement other logging or streaming options. This allows for very flexible programming. We constructed the components that way to allow users to specialize for their particular experiments and possibly increase performance or extend the symbolic regression, e.g. by replacing the deap tree-based representation of an individual.

Remote Control

One main objective of glyph is its use in a real experiment. In this case, the GP loop is separated from the experimental loop in a client-server setup using ZeroMQ [Akgul:2013:ZER:2523409], cf. Fig. 2.

Figure 2: Sketch of the implementation of the experiment - GP communication as client-server pattern. Left: single experiment server plus event handler. Right: GP client. Both parts are interfaced using ZeroMQ. As described in Sec. Communication Protocol the GP program performs requests, e.g. the evaluation of a candidate solution. The event handler takes care of these requests and eventually forwards them to the hardware.

Consequently, one should implement the interface to the experiment using the protocol described in Sec. Communication Protocol. Having the implementation of the experiment, the server, one needs to implement the client, i.e. the interface to the gp_runner. In essence this means connecting the correct sockets whith ZeroMQ and ensuring that the gp_runner and the assessment_runner use the corresponding sockets. Then, the main application is assembled as before, now using a RemoteApp for the main application, which in turn uses a gp_runner, which then uses now a RemoteAssessmentRunner. That is it, we can run remotely our GP evaluation from some client and the experiment in place of the experiment.

Communication Protocol

The communication is encoded in json [ecma404]. A message is a json object with two members:

1{
2    "action": "value",
3    "payload": "value",
4}

The possible values are listed in Table 1.

Action name Payload Expected return Value
CONFIG config settings
EXPERIMENT list of expressions list of fitness value(s)
SHUTDOWN
Table 1: Communication protocol.

The config action is performed prior to the evolutionary loop. Entering the loop, discovered solutions will be batched and a experiment action will be requested. You can configure optional caching for re-discovered solutions. This includes persistent caching between different runs. The shutdown action will let the experiment program know that the gp loop is finished and you can safely stop the hardware.

Configuration settings are sent as a json object in key:value form, where the keys contain the option to be set, there is only one mandatory option: the primitive set. To configure the primitive set, the primitive names are passed as content of the key config, whose values specify the corresponding arities, both fields described again as json object.

The experiment action sends a list of expressions, encoded as string in prefix (also: polish) notation [jorke1989arithmetische]. For each expression sent, the experiment returns a fitness tuple.

Additionally, one can define the type of algorithm, error metric, representation, hyperparameters, etc. A comprehensive up to date list can be found at http://glyph.readthedocs.io/en/latest/usr/glyph_remote/.

Application example: control of the chaotic Lorenz System

In the following, we demonstrate the application and use of Glyphby the determination of an unknown optimal control law for a chaotic system. As an example, we study the control of the potentially chaotic Lorenz system. Chaotic systems are very hard to predict and control in practice due to their sensitivity towards small changes in the initial state which may lead to exponential divergence of trajectories. The Lorenz model [Lorenz1963] consists of a system of three ordinary differential equations:

(1)

with two nonlinearities, and . Here , , and make up the system state and , , are parameters: is the Prandtl number, is the Rayleigh number, and b is related to the aspect ratio of the air rolls. For a certain choice of parameters and initial conditions chaotic behavior emerges.

Here we present two examples where the target is to learn control of bring a chaotic Lorenz system to a complete stop, that is, (). In the first example, the actuator term is applied to . This allows for a more direct control of the system, since appears in every equation of (1) and, thus, influence all three state components, , , and . In the second example the actuator term is applied to , which leads to a more indirect control, since the flow of information from to is only through .

population size
max. generations
MOO algorithm NSGA-II
tree generation halfandhalf
min. height 1
max. height 4
selection selTournament
tournament size
breeding varOr
recombination cxOnePoint
crossover probability
crossover max. height
mutation mutUniform
mutation probability
mutation max. height
constant optimization leastsq
Table 2: General setup of the GP runs.
dynamic system GP
cost functionals RMSE
RMSE
RMSE
length
argument set
constant set
seed (in )
seed (in )
Table 3: Control of the Lorenz system: system setup.

The system setup is summarized in Table 2 and Table 3. When , , and , the Lorenz system produces chaotic solutions (not all solutions are chaotic). Almost all initial points will tend to an invariant set – the Lorenz attractor – a strange attractor and a fractal. When plotted the chaotic trajectory of the Lorenz system resembles a butterfly (blue graph in Fig. 3). The target of control is, again, formulated as RMSE of the system state with respect to zero (separately for each component)

The control function can make use of ideal measurements of the state components. Constant optimization is performed on a single constant . The respective GP runs for control in and control in are conducted with the corresponding random seeds labeled “in ” and “in ”.

Control in :

For control in the actuator term is added to the left side of the equation for in the uncontrolled system (1):

The Pareto solutions from the GP run are shown in Table 4. The wide spread of the cost indices is a sign of conflicting objectives that are hard to satisfy in conjunction. Interestingly, almost all solutions, , commonly introduce a negative growth rate into . This effectively drives to zero and suppresses the growth terms, and , in the equations for and respectively, in turn, driving and to zero as well. As would be expected, minimal expressions, of length 1 or 2, cannot compete in terms of the RMSE. For example, the simple solution, (fourth row), is almost as good as the lengthier one, (first row), and even better in RMSE.

RMSE RMSE RMSE length expression constants
Table 4: Control of the Lorentz system in : Pareto-front solutions.

Table 4 shows the results from the GP run. One solution immediately stands out: , with (second row). It is exactly what one might expect as a control term for the chaotic Lorenz system with control in . This control law effectively reduces the Rayleigh number to a value close to zero (), pushing the Lorenz system past the first pitchfork bifurcation, at , back into the stable-origin regime. If then there is only one equilibrium point, which is at the origin. This point corresponds to no convection. All orbits converge to the origin, which is a global attractor, when .

The phase portrait of the solution from the first and second row of Table 4 are illustrated in Fig. 3. After a short excursion in negative direction (), the green trajectory quickly converges to zero. The red trajectory seems to take a shorter path in phase space, but, it is actually slower to converge to the origin. This is verified by a plot of the trajectories for the separate dimensions , and over time Fig. 4.


Figure 3: Phase portrait of the forced Lorenz system with control exerted in . (Green and red: The system trajectories when controlled by two particular Pareto-front solutions. Blue: the uncontrolled chaotic system.)
Figure 4: Detailed view of the single trajectories in , , and dimension. (blue: uncontrolled; green: , ; red: , .)
Control in :

For control in the actuator term is added to the left side of the equation for in the uncontrolled system (1)

Selected Pareto-front individuals from the GP run are displayed in Table 5. As mentioned at the beginning of this section, effective control is hindered by the indirect influence of on the other state variables, hence, it is not surprising that the control laws here are more involved than in the previous case. Also, they generally do not perform well in the control of , which is expressed by the relatively high values in RMSE. This is confirmed by the phase portrait of the solution shown in figure Fig. 5: While going straight to the origin in the -plane there are strong oscillations of the trajectory along the -axis.

The dynamics caused by the actuation, e.g. for the best control law found, can be explained qualitatively: there is a strong damping in all variables but . This reflects the tendency to suppress -oscillations and, at the same time, to add damping in through the term: if grows, the contribution to damping on the right hand side of the Lorenz equations (1) grows and, in turn, damps . This is, however, only possible to some extent, hence, the oscillations observed in figure Fig. 5.

RMSE RMSE RMSE length expression constants
Table 5: Control of the Lorentz system in : selected Pareto-front solutions.

Figure 5: Control of the Lorentz system in .

We conclude the demonstration with a short summary: Using Glyphwe can find complex control laws, even for unknown systems. This cannot be easily achieved with other frameworks. The control laws found can be studied analytically in contrast to several other methods which have black-box character. The usage is straightforward, as we have described above. The above example can be found online as an example.

Other symbolic regression libraries

Due to its popularity, symbolic regression is implemented by most genetic programming libraries. A semi-curated list can be found at http://geneticprogramming.com/software/. In contrast to other implementations, Glyphimplements higher concepts, such as symbolic constant optimization, and also offers parallel execution for complex examples (control simulation, system identification). Glyphis well tested, cf. Table 6 and currently applied in two experiments and several numerical problems. For control, there exists a dedicated matlab toolbox (with python interface), openMLC [machinelearningcontrol_2017], which contains much of the material treated in [MLC].

CI/tests doc open api caching checkpointing MOGP SCO MO
openMLC
Glyph
Table 6: Comparison of Glyphand openMLC features. MOGP refers to multi objective optimization. MO means multiple outputs. SCO means symbolic constant optimization.

Quality control

Continuous Integration tests are conducted for Mac, Linux and Windows using Travis and AppVoyer. The tests consider Python version 3.5 and 3.6 . Unit test coverage is around 85% as reported by codecov. Additionally, tests specifically cover the stochastic parts of the optimization to ensure reproducibility. Along with the software tests are shipped which guarantee the correct execution of the examples. The user can reuse these tests for further development. Locally, tests can be executed via the pytest command.

(2) Availability

Operating system

Glyphis compatible with Mac, Linux and Windows.

Programming language

Python 3.5+

Dependencies

Currently, Glyphis based on DEAP [DEAP_JMLR2012], an evolutionary computation framework adopting a toolbox-like structure for rapid prototyping. Further dependencies are found up-to-date at https://github.com/Ambrosys/glyph/blob/master/requirements.txt.

List of contributors

Core contributors (prior to open source): Markus Quade Julien Gout

Open source contributors can be found at https://github.com/Ambrosys/glyph/graphs/contributors.

Software location:

Archive Zenodo

Name:

Ambrosys/glyph

Persistent identifier:
Licence:

LGPL

Publisher:

Markus Quade

Version published:

0.3.3

Date published:

12.09.17

Code repository Github

Name:

glyph

Persistent identifier:
Licence:

LGPL

Date published:

08.12.16

Language

English

(3) Reuse potential

The potential to use Glyphis twofold: one one hand applications can be easily written and the elegant core functionality can be extended; on the other hand, researchers can use the code as core for symbolic regression and extend its functionality in a very generic way. With respect to applications, currently two main directions are targeted: modeling using genetic programming- based symbolic regression and the control of complex system, where a control law can be found generically, using genetic programming. The detailed examples and tutorial allow usage from beginner to experienced level, i.e. undergraduage research projects to faculty research. The design of Glyphis such that generic interfaces are provided allowing for very flexible extension.

Acknowledgements

We acknowledge very fruitful discussions with respect to applications of MLC with S. Brunton, B. Noack, A. Pikovsky, M. Rosenblum, R. Semaan, and B. Strom and coding gossip with V. Mittal.

Funding statement

This work has been partially supported by the German Science Foundation via SFB 880. MQ was supported by a fellowship within the FITweltweit program of the German Academic Exchange Service (DAAD).

Competing interests

The authors declare that they have no competing interests.

\printbibliography
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
131582
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description