Update frame repacking to use new pyarrow data types #2470

larsyencken · 2024-03-28T10:50:03Z

Context

We do frame repacking to make our data frames smaller on disk and faster to work with. It has some slight annoyances, such as making more variables categorical, which complicates group-bys, for example.

What

With Pandas 2.2, we have the option to change our frame repacking to use new pyarrow data types, which are supposed to be much more efficient.

That would also bring our data catalog into compatibility with more of the data ecosystem (e.g. Polars, Nushell and friends).

larsyencken · 2024-03-28T10:50:17Z

Blocked on:

Upgrade pandas to 2.2.x #1094

larsyencken · 2024-04-11T09:21:49Z

From discussion: we could also consider a gradual migration, e.g. enabling pyarrow types for new datasets rather than applying it over everything.

We could also bench the performance gains to see if they're worth it.

larsyencken · 2024-04-11T09:26:30Z

Pablo noted that adding commitments to our future selves can mean that "small" data updates can expand. We should try to make more things as automatic as possible.

stale · 2024-06-10T23:32:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Marigold · 2024-06-11T05:14:10Z

Don't close it. We should at least compare the performance (CPU & mem) of current repacking vs new pyarrow dtypes.

github-actions bot added the needs triage label Mar 28, 2024

larsyencken added priority 3 - nice to have and removed needs triage labels Apr 11, 2024

stale bot added the wontfix This will not be worked on label Jun 10, 2024

stale bot removed the wontfix This will not be worked on label Jun 11, 2024

larsyencken added the pinned label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update frame repacking to use new pyarrow data types #2470

Update frame repacking to use new pyarrow data types #2470

larsyencken commented Mar 28, 2024

larsyencken commented Mar 28, 2024 •

edited

larsyencken commented Apr 11, 2024

larsyencken commented Apr 11, 2024

stale bot commented Jun 10, 2024

Marigold commented Jun 11, 2024

Update frame repacking to use new pyarrow data types #2470

Update frame repacking to use new pyarrow data types #2470

Comments

larsyencken commented Mar 28, 2024

Context

What

larsyencken commented Mar 28, 2024 • edited

larsyencken commented Apr 11, 2024

larsyencken commented Apr 11, 2024

stale bot commented Jun 10, 2024

Marigold commented Jun 11, 2024

larsyencken commented Mar 28, 2024 •

edited