Parallelize encoding of a single row #546

selitvin · 2020-04-20T05:26:02Z

When writing data into a petastorm dataset. Before a pyspark sql.Row
object is created, fields containing data that is not natively supported
by Parqyet format, such as numpy arrays, are serialized into byte
arrays. Images maybe compressed using png or jpeg compression.

Serializing fields on a thread pool speeds up this process in some
cases (e.g. a row contains multiple images).

Fields that contain data that is not natively supported by Parqyet format, such as numpy arrays, are serialized into byte arrays. Images maybe compressed using png or jpeg compression. Serializing fields on a thread pool speeds up this process in some cases (e.g. a row contains multiple images). This PR adds a pool executor argument to `dict_to_spark_row` enabling user to pass a pool executor that would be used for parallelizing this serialization. If no pool executor is specified, the encoding/serialization is performed on the caller thread.

codecov · 2020-04-20T07:15:16Z

Codecov Report

Base: 82.88% // Head: 85.99% // Increases project coverage by +3.11% 🎉

Coverage data is based on head (3fe68d4) compared to base (83a02df).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
+ Coverage   82.88%   85.99%   +3.11%     
==========================================
  Files          85       87       +2     
  Lines        4721     4935     +214     
  Branches      744      783      +39     
==========================================
+ Hits         3913     4244     +331     
+ Misses        678      568     -110     
+ Partials      130      123       -7

Impacted Files	Coverage Δ
petastorm/unischema.py	`96.91% <100.00%> (+1.12%)`	⬆️
petastorm/reader_impl/pytorch_shuffling_buffer.py	`96.42% <0.00%> (ø)`
petastorm/benchmark/dummy_reader.py	`0.00% <0.00%> (ø)`
petastorm/py_dict_reader_worker.py	`95.23% <0.00%> (+0.79%)`	⬆️
petastorm/spark/spark_dataset_converter.py	`91.76% <0.00%> (+1.49%)`	⬆️
petastorm/pytorch.py	`94.21% <0.00%> (+1.53%)`	⬆️
petastorm/arrow_reader_worker.py	`92.00% <0.00%> (+2.00%)`	⬆️
petastorm/compat.py	`100.00% <0.00%> (+39.02%)`	⬆️
..._dataset_converter/tests/test_converter_example.py	`100.00% <0.00%> (+46.66%)`	⬆️
examples/spark_dataset_converter/utils.py	`100.00% <0.00%> (+62.50%)`	⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

CLAassistant · 2023-02-16T07:46:47Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Yevgeni Litvin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

selitvin marked this pull request as draft April 20, 2020 05:28

selitvin force-pushed the parallel_encoding branch from 367736c to 5fe4e11 Compare April 20, 2020 05:38

selitvin force-pushed the parallel_encoding branch from 5fe4e11 to 3fe68d4 Compare April 20, 2020 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize encoding of a single row #546

Parallelize encoding of a single row #546

selitvin commented Apr 20, 2020

codecov bot commented Apr 20, 2020 •

edited

CLAassistant commented Feb 16, 2023

Parallelize encoding of a single row #546

Are you sure you want to change the base?

Parallelize encoding of a single row #546

Conversation

selitvin commented Apr 20, 2020

codecov bot commented Apr 20, 2020 • edited

Codecov Report

CLAassistant commented Feb 16, 2023

codecov bot commented Apr 20, 2020 •

edited