Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update frame repacking to use new pyarrow data types #2470

Open
larsyencken opened this issue Mar 28, 2024 · 5 comments
Open

Update frame repacking to use new pyarrow data types #2470

larsyencken opened this issue Mar 28, 2024 · 5 comments

Comments

@larsyencken
Copy link
Collaborator

Context

We do frame repacking to make our data frames smaller on disk and faster to work with. It has some slight annoyances, such as making more variables categorical, which complicates group-bys, for example.

What

With Pandas 2.2, we have the option to change our frame repacking to use new pyarrow data types, which are supposed to be much more efficient.

That would also bring our data catalog into compatibility with more of the data ecosystem (e.g. Polars, Nushell and friends).

@larsyencken
Copy link
Collaborator Author

larsyencken commented Mar 28, 2024

Blocked on:

@larsyencken
Copy link
Collaborator Author

From discussion: we could also consider a gradual migration, e.g. enabling pyarrow types for new datasets rather than applying it over everything.

We could also bench the performance gains to see if they're worth it.

@larsyencken
Copy link
Collaborator Author

Pablo noted that adding commitments to our future selves can mean that "small" data updates can expand. We should try to make more things as automatic as possible.

Copy link

stale bot commented Jun 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jun 10, 2024
@Marigold
Copy link
Collaborator

Don't close it. We should at least compare the performance (CPU & mem) of current repacking vs new pyarrow dtypes.

@stale stale bot removed the wontfix This will not be worked on label Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants