-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store tables as PARQUET files #419
Open
hagenw
wants to merge
62
commits into
main
Choose a base branch
from
pyarrow
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+895
−132
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
|
2c0f12f
to
625d4c3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #382
Closes #376
Several improvement/speed ups to the table handling by using
pyarrow
:pyarrow
pandas >=2.2.0
pyarrow
pandas >=2.2.0
)I decided to stay with
"csv"
as the default setting for thestorage_format
argument inaudformat.Database.save()
andaudformat.Table.save()
. This way, we could make a new release ofaudformat
and can test storing tables as PARQUET file with some databases, without exposing it to other users as well. In addition,audb
needs to be updated to support publishing PARQUET tables.Loading benchmark
We can benchmark the behavior with loading a dataset from a folder, that contains all tables with
audformat.Database.load(db_root, load_data=True)
. Results are given as average over 10 runs.Benchmark code
The benchmark highlights two important results:
Memory benchmark
I investigated memory consumption using heaptrack, when loading the
phonetic-transcription.train-clean-360
table from thelibrispeech
dataset from our internal server. Stored as CSV file the table has a size of 1,3 G, stored as PARQUET file 49 M.Benchmark code
csv-table-loading.py
The memory consumption is then profiled with:
The execution time was measured without
heaptrack
:Why and when is reading from a CSV file slow?
By far the slowest part when reading a CSV file with
pyarrow
is the conversion topandas.Timedelta
values for columns, that specify a duration.E.g. when reading the CSV from the memory benchmark, the reading with
pyarrow
and conversion topandas.DataFrame
takes 3 s, whereas the conversion of thestart
andend
column topandas.Timedelta
takes roughly 40 s.There is a dedicated dtype with
pyarrow.duration
, but it does not have yet reading support for CSV files. When trying so, you get:When storing a table as a PARQUET file we use
duration[ns]
for time values, and converting those back topandas.Timedelta
seems to be much faster. Reading the PARQUET file needs 0.3 s, converting to the dataframe then takes the remaining 3.4 s.The conversion can very likely be speed up when switching to use
pyarrow
based dtypes in the dataframe as we do foraudb.Dependencies._df
, but at the moment this is not fully supported inpandas
, e.g. timedelta values are not implemented yet (pandas-dev/pandas#52284).Hashing of PARQUET files
As we have experienced already at audeering/audb#372, PARQUET files are not stored in a reproducible fashion and might return different MD5 hash sums, even though they store the same dataframe.
To overcome this problem, I calculate now a hash based on the content of the dataframe using
audformat.utils.hash()
and store the resulting value inside the metadata of schema of the PARQUET file. Which means inaudb
we can access it by just loading the schema of a PARQUET file. The corresponding code to access the hash is:This approach is faster than calculating the MD5 sum with
audeer.md5()
.Execution time benchmarked as average over 100 repetitions:
Benchmark code
The downside is that
audformat.utils.hash()
requirespandas >=2.2.0
aspandas
has changed its hash calculation, which means older versions ofpandas
return a different hash, see pandas-dev/pandas#58999. I tried to implement a custom hashing to not rely onpandas
(as they might change the behavior again in the future), but I did not succeed in implementing something that is fast and requires not much memory.Writing benchmark
Comparison of saving the
phonetic-transcription.train-clean-360
table fromlibrispeech
in different formats.Note, saving as a PARQUET file includes calculating of the hash of the underlying dataframe.
Benchmark code