README

About

CSRM (Convergence of Symbolic Regression using Metaheuristics) is an incremental distributed Symbolic Regression tool written in Python. Unlike most optimization tools it will provide you with an extensive amount of data (visual and logged) of how it evolved its solution.

The remainder of this document gets you up and running in the shortest amount of time.

This project fulfilled the requirement for my Master Thesis in Computer Science at the University of Antwerp, Belgium, under the guidance of Prof. Dr. Jan Broeckhove.

For the thesis text please see :

./thesis/thesis/MsCThesisBenCardoen.pdf

Getting the code

$git clone https://bitbucket.org/bcardoen/csrm

Dependencies

Python3
Sortedcontainers :
```
$pip3 install sortedcontainers
```
Numpy :
```
$pip3 install numpy
```
Scipy :
```
$pip3 install scipy
```
Bokeh :
```
$pip3 install bokeh
```
MPI and mpi4py: MPI is the C/C++ framework, mpi4py are the required python bindings. The following links will get you started:

https://www.mpich.org/downloads

https://pypi.python.org/pypi/mpi4py

Packages are available for Fedora/Debian/MacOS and Windows Make sure you install the development version as well as the main version.

Fedora
```
$sudo dnf install mpich-devel mpich python3-mpi4py-mpich
```
If mpiexec isn't found, add it to your path like so:
```
export PATH=/usr/lib64/mpich/bin/:$PATH
```
The MPICH path can obviously be different on your system.

On Ubuntu a typical install would be
```
$sudo apt-get mpich python3-mpi4py
```
Optional : graphviz

Tests

You can execute the tests from the 'src' folder by executing:

    $python3 -m expression.testtree
    $python3 -m gp.testalgorithm

For most of the tests trees are generated in dot format in the 'output' folder. A small script is present in this folder that call graphviz to render the trees in svg format, run from within the output folder:

   $./dotify.sh

Source code documentation

Documentation is present in the folder 'doc/html/index/html'

With all that out of the way, let's get started with running the code

Hello World in Symbolic regression

In a typical use case you have a set of n features, each with k datapoints forming a X = n x k matrix.

You also have a set of expected data, Y, an n x 1 vector.

You want CSRM to find a mathematical expression (non linear) f(X) = Y' s.t. the distance between Y and Y' is minimal.

Input Data format

You'll need to encode X and Y in a format that CSRM understands.

CSRM expects a simple CSV textfile, where ',' is the delimiter, and each line is interpreted as k datapoints for a single feature. Values should be in a floating point format (integers are obviously fine).

Output

CSRM produces a lot of output, depending on your preferences. We distinguish between convergence characteristics (i.e. rate of convergence, accuracy, depth, and more..) and the actual results.

CSRM not only offers you the results, but also how it came by them, so don't expect a black box. You can obviously ignore all that data and simply focus on the results.

Output format

Output is written to a local folder ./output// configstring is a concatenation of all configuration options you used in starting the program. CSRM writes a JSON and HTML file per process. In addition a summarized file (in JSON and HTML) is written that summarizes the all results. Finally, the results of the search, the expressions you wanted to obtain, are written in a text file (bestresults.txt) and a HTML page (bestresults.html).

Running

I will give you two copy-pasteable commands that 'just work'. The first is a sequential run, the second a distributed run.

Sequential

$python3 -m gp.paralleldriver -x doe/input.csv10 -q 3 -y doe/output.csv10 -d 10 -c 1 -f 20 -p 20 -g 20 -v

-x is the input file (3 x 10)

-q specifies the number of features CSRM should expect (3)

-y is the expected data file (1 x 10)

-d is the number of datapoints (10)

-f is the number of phases (restarts reusing the best expressions from the last run)

-p is the population

-g is the number of generations

-c is the number of processes, 1 in this case.

-v instructs CSRM to open all HTML output in your OS's preferred web browser.

Initially you'll be most interested in the 'bestresults.html' page, which lists the best expressions ordered by best fitness score to worst, gives a some statistics about the distribution and lists the configuration you used.

Distributed

$mpiexec -n 3 python3 -m gp.paralleldriver -x doe/input.csv10 -q 3 -y doe/output.csv10 -d 10 -c 3 -f 20 -p 20 -g 20 -v -t tree

Note that we use MPI now to prefix the command string, and specify both the number of processes and the topology

-n Has to match -c

-t The topology to use. Use either tree, grid, none or random.

The results are again written to file and displayed in a browser of your choice.

More

-h will give you all parameters and their meaning, with even more information in the documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
demo		demo
doc		doc
notes		notes
plots		plots
results		results
src		src
testfiles		testfiles
thesis		thesis
.gitignore		.gitignore
README.md		README.md
benchmarks.txt		benchmarks.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

About

Getting the code

Dependencies

Tests

Source code documentation

Hello World in Symbolic regression

Input Data format

Output

Output format

Running

More

About

Releases

Packages

Contributors 2

bencardoen/CSRM

Folders and files

Latest commit

History

Repository files navigation

README

About

Getting the code

Dependencies

Tests

Source code documentation

Hello World in Symbolic regression

Input Data format

Output

Output format

Running

More

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages