Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figuring out how to reduce memory burden when processing large datasets #143

Open
JacobGlennAyers opened this issue Oct 28, 2022 · 1 comment

Comments

@JacobGlennAyers
Copy link
Contributor

Currently, it appears that PyHa crashes when generating automated labels of particularly large datasets. I suspect that this is due to the size of the automated dataframe becoming too large.

Potential fixes:

  1. Convert floats being stored int 8 byte floats (float8). By default, Pandas uses float64.
  2. Try to use the builtin csv python library. It could be that we just create the individual dataframes on each clip, and then we just append
    to some master csv file. This would hopefully shift the burden from memory onto storage.
  3. Look into parallelization with DASK (this may speed things up, but I am skeptical if it addresses the memory problems)
@sprestrelski
Copy link
Member

How large are said datasets, and what model was used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants