Inconsistencies in benchmark results #19

SekouDiaoNlp · 2021-05-14T10:38:39Z

SekouDiaoNlp
May 14, 2021
Collaborator

I have been running the benchmark quite a few times to generate the plot and I noticed that on this AWS instance c5.4xlarge with 16 CPUs and 32GB of RAM running ubuntu 20.04, the results differ from the results in the table in the README.

When I run the benchmarks on this machine I get the following results:

{'filename': 'sample.csv', 'repetitions': 10000, 'timer': 'real'}
PANDAS_READ_CSV =       12.118

NUMPY_FROMFILE =        2.0003

NUMPY_LOADTXT = 0.5655

NUMPY_GENFROMTXT =      1.97

CSV =   0.1583

CSV_LIST =      0.1561

CSV_MAP =       0.1631

FASTER_THAN_CSV =       5.7811

I believe faster-than_csv should be the fastest by far.

Is there anything that changed in the NIM compiler or the optimization options of either the NIM compiler or of the C/C++ compiler since the last time you ran the benchmark?

I used the results in the README table to generate the graph but my own bench-marking do not agree with those results.

Can you re-run the benchmark and let me know if you get the same behavior as me?

I am currently running the benchmark on a much bigger file of 24 MB 35 columns with mixed float, int, str and bool entries with 142,695 rows.

I will update this thread with the results of this benchmark.

SekouDiaoNlp · 2021-05-14T12:12:53Z

SekouDiaoNlp
May 14, 2021
Collaborator Author

I ended up testing with this file (New-Zealand-business-demography-statistics-At-February-2020) which is more representative of the kind of data loaded in Data Science.

It is 120 MB in size and has 5,429,253 rows of 5 columns of str and int, and is big enough that startup and compilation time should not matter much.

The benchmark is running right now with 10,000 repetitions for each method.

I will post the results when the benchmark is done.

7 replies

SekouDiaoNlp May 14, 2021
Collaborator Author

The final results for this 120 MB file (I removed 'numpy_loadtxt' as it was choking on the file):

{'filename': 'sample.csv', 'repetitions': 101, 'timer': 'real'}
PANDAS_READ_CSV =       283.992

NUMPY_FROMFILE =        1980.375

NUMPY_GENFROMTXT =      2108.9164

CSV =   386.3869

CSV_LIST =      360.5219

CSV_MAP =       303.8799

FASTER_THAN_CSV =       365.0886

Whats the command used to compile ?.

I ran the supplied run-benchmark.sh after building and running the environment defined in the docker file just changing the number of repetitions to 101.

Whats the GCC version ?.

root@46c0c2c10088:/tmp# gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

SekouDiaoNlp May 14, 2021
Collaborator Author

Whats the command used to compile ?.

In the original benchmark I ran the docker script and the benchmark unmodified using the file 'sample.csv'

Whats the GCC version ?.

root@46c0c2c10088:/tmp# gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

juancarlospaco May 14, 2021
Maintainer

Try again with the Docker, that should be the best performance.
The GCC seems old, I have GCC 10.2.0.

SekouDiaoNlp May 14, 2021
Collaborator Author

Try again with the Docker, that should be the best performance.

Ok I will

The GCC seems old, I have GCC 10.2.0.

It is the LTS version of ububntu, packages tend to be a bit old.

I will try and let you know.

juancarlospaco May 14, 2021
Maintainer

Yeah I know, I use Artix Linux with BedRock Linux, so is like a mix of Arch Linux and Alpine Linux, with Nim 1.5.1.

I think we should remove the map one because is not doing the same as the rest.

SekouDiaoNlp · 2021-05-14T19:35:04Z

SekouDiaoNlp
May 14, 2021
Collaborator Author

I just re-ran the benchmark from the updated dockerfile with the new compile flags and the same gcc version 9.3.0.
The results are:

{'filename': 'sample.csv', 'repetitions': 10000, 'timer': 'real'}
PANDAS_READ_CSV =       12.0711

NUMPY_FROMFILE =        1.9925

NUMPY_LOADTXT = 0.5585

NUMPY_GENFROMTXT =      1.9629

CSV =   0.157

CSV_LIST =      0.159

CSV_MAP =       0.1627

FASTER_THAN_CSV =       5.7969

I will retry with gcc 10.2.0

0 replies

SekouDiaoNlp · 2021-05-14T19:36:41Z

SekouDiaoNlp
May 14, 2021
Collaborator Author

I think we should remove the map one because is not doing the same as the rest.

I think it is ok to leave it here I have seen it used many times.

0 replies

juancarlospaco · 2021-05-14T23:15:07Z

juancarlospaco
May 14, 2021
Maintainer

I fixed it, I ran it more than 20 times, it gives me always half the time of the fastest, GCC 10.2 and Nim 1.5.

1 reply

SekouDiaoNlp May 15, 2021
Collaborator Author

Cool I will retry it today 😃

SekouDiaoNlp · 2021-05-15T08:35:08Z

SekouDiaoNlp
May 15, 2021
Collaborator Author

The results are much better now on the same machine with gcc 9.3.0 and nim 1.4.2:

{'filename': 'sample.csv', 'repetitions': 1000, 'timer': 'real'}
PANDAS_READ_CSV =       1.2201

NUMPY_FROMFILE =        0.1994

NUMPY_LOADTXT = 0.0569

NUMPY_GENFROMTXT =      0.1938

CSV =   0.0159

CSV_LIST =      0.0158

FASTER_THAN_CSV =       0.0149

I will try again with GCC 10.2 and Nim 1.5, but there is clearly a great improvement already.
As many people are on Ubuntu LTS it's great that it is the fastest again!

0 replies

SekouDiaoNlp · 2021-05-15T09:51:49Z

SekouDiaoNlp
May 15, 2021
Collaborator Author

I re-ran the benchmark on the same system but with a tweaked docker image to force the installation of the latest versions of GCC and Nim.

root@56056555239b:/tmp# gcc --version
gcc (Ubuntu 10.3.0-1ubuntu1~20.04~2) 10.3.0

root@56056555239b:/tmp# nim --version
Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2021-05-15
Copyright (c) 2006-2021 by Andreas Rumpf

root@56056555239b:/tmp# ./run-benchmark.sh
{'filename': 'sample.csv', 'repetitions': 1000, 'timer': 'real'}
PANDAS_READ_CSV =       1.0157

NUMPY_FROMFILE =        0.1679

NUMPY_LOADTXT = 0.0473

NUMPY_GENFROMTXT =      0.1908

CSV =   0.0136

CSV_LIST =      0.0137

FASTER_THAN_CSV =       0.0067

Now with the latest compiler versions it is twice as fast as the second fastest 👍🏿 🥇

1 reply

juancarlospaco May 15, 2021
Maintainer

I think this is great because those libraries have decades of development, an army of developers, a ton of lines of code, companies behind, sponsors, etc

...and we can beat them. 🥳

SekouDiaoNlp · 2021-05-15T10:07:34Z

SekouDiaoNlp
May 15, 2021
Collaborator Author

I had a look at the optimization you made.

Basically you discarded the columns parameter and created a new array to hold the columns for each row.

Did the compiler flags you changed a bit earlier have any effect? Or is the bulk of the improvements from refactoring faster_than_csv.nim to be more efficient?

I guess there has been a change in the behavior of the Nim compiler since you first ran the benchmark.

3 replies

juancarlospaco May 15, 2021
Maintainer

Nah, it was my error.

juancarlospaco May 15, 2021
Maintainer

Nim typically gets faster and faster with each release, and does not break too much, it has Deprecated and Experimental stuff.

Having compile-time memory management like Rust kinda helps too.

SekouDiaoNlp May 15, 2021
Collaborator Author

Cool now it is really faster.

I re-ran the benchmark with the 120 MB file with the latest compilers and here are the results:

root@56056555239b:/tmp# ./run-benchmark.sh
{'filename': 'sample.csv', 'repetitions': 101, 'timer': 'real'}
PANDAS_READ_CSV =       241.0715

NUMPY_FROMFILE =        2312.7144

NUMPY_GENFROMTXT =      2318.1822

CSV =   365.1048

CSV_LIST =      338.7259

FASTER_THAN_CSV =       198.6154

Pandas gets more efficient as the file gets bigger and becomes second place,while faster_than_csv is still the fastest also 👍🏿

SekouDiaoNlp · 2021-05-15T16:13:40Z

SekouDiaoNlp
May 15, 2021
Collaborator Author

Btw @juancarlospaco you should add a description for the pypi page.

5 replies

juancarlospaco May 15, 2021
Maintainer

I dont know how to do it, from the web theres no option.
If is made from the code, and you know how to do it, feel free to make a PR.

SekouDiaoNlp May 15, 2021
Collaborator Author

Ok I will look into it.
Seems that modifying /dist/PKG-INFO should do the trick.

juancarlospaco May 15, 2021
Maintainer

But is not better to just link to the README ?.
Because otherwise we have duplicated maintenance of the README on PyPI too...

Just add a link to the README in my opinion.

SekouDiaoNlp May 15, 2021
Collaborator Author

Done. see #21

juancarlospaco May 15, 2021
Maintainer

Merged.

juancarlospaco · 2021-05-16T16:52:05Z

juancarlospaco
May 16, 2021
Maintainer

This works.

0 replies

Inconsistencies in benchmark results #19

SekouDiaoNlp May 14, 2021 Collaborator

Replies: 9 comments · 17 replies

SekouDiaoNlp May 14, 2021 Collaborator Author

SekouDiaoNlp May 14, 2021 Collaborator Author

SekouDiaoNlp May 14, 2021 Collaborator Author

juancarlospaco May 14, 2021 Maintainer

SekouDiaoNlp May 14, 2021 Collaborator Author

juancarlospaco May 14, 2021 Maintainer

SekouDiaoNlp May 14, 2021 Collaborator Author

SekouDiaoNlp May 14, 2021 Collaborator Author

juancarlospaco May 14, 2021 Maintainer

SekouDiaoNlp May 15, 2021 Collaborator Author

SekouDiaoNlp May 15, 2021 Collaborator Author

SekouDiaoNlp May 15, 2021 Collaborator Author

juancarlospaco May 15, 2021 Maintainer

SekouDiaoNlp May 15, 2021 Collaborator Author

juancarlospaco May 15, 2021 Maintainer

juancarlospaco May 15, 2021 Maintainer

SekouDiaoNlp May 15, 2021 Collaborator Author

SekouDiaoNlp May 15, 2021 Collaborator Author

juancarlospaco May 15, 2021 Maintainer

SekouDiaoNlp May 15, 2021 Collaborator Author

juancarlospaco May 15, 2021 Maintainer

SekouDiaoNlp May 15, 2021 Collaborator Author

juancarlospaco May 15, 2021 Maintainer

juancarlospaco May 16, 2021 Maintainer

SekouDiaoNlp
May 14, 2021
Collaborator

Replies: 9 comments 17 replies

SekouDiaoNlp
May 14, 2021
Collaborator Author

SekouDiaoNlp May 14, 2021
Collaborator Author

SekouDiaoNlp May 14, 2021
Collaborator Author

juancarlospaco May 14, 2021
Maintainer

SekouDiaoNlp May 14, 2021
Collaborator Author

juancarlospaco May 14, 2021
Maintainer

SekouDiaoNlp
May 14, 2021
Collaborator Author

SekouDiaoNlp
May 14, 2021
Collaborator Author

juancarlospaco
May 14, 2021
Maintainer

SekouDiaoNlp May 15, 2021
Collaborator Author

SekouDiaoNlp
May 15, 2021
Collaborator Author

SekouDiaoNlp
May 15, 2021
Collaborator Author

juancarlospaco May 15, 2021
Maintainer

SekouDiaoNlp
May 15, 2021
Collaborator Author

juancarlospaco May 15, 2021
Maintainer

juancarlospaco May 15, 2021
Maintainer

SekouDiaoNlp May 15, 2021
Collaborator Author

SekouDiaoNlp
May 15, 2021
Collaborator Author

juancarlospaco May 15, 2021
Maintainer

SekouDiaoNlp May 15, 2021
Collaborator Author

juancarlospaco May 15, 2021
Maintainer

SekouDiaoNlp May 15, 2021
Collaborator Author

juancarlospaco May 15, 2021
Maintainer

juancarlospaco
May 16, 2021
Maintainer