Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage increases across multiple parallel_apply #264

Open
2 tasks done
hogan-roblox opened this issue Mar 4, 2024 · 2 comments
Open
2 tasks done

Memory usage increases across multiple parallel_apply #264

hogan-roblox opened this issue Mar 4, 2024 · 2 comments

Comments

@hogan-roblox
Copy link

hogan-roblox commented Mar 4, 2024

General

  • Operating System: Linux
  • Python version: 3.10.8
  • Pandas version: 1.5.3
  • Pandarallel version: 1.6.5

Acknowledgement

  • My issue is NOT present when using pandas without alone (without pandarallel)
  • If I am on Windows, I read the Troubleshooting page
    before writing a new bug report

Bug description

If I run continuous data processing tasks, each with a huge DataFrame using parallel_apply , their MEM footprints somehow accumulates.

Observed behavior

My code logic looks like below.

pandarallel.initialize(progress_bar=True, nb_workers=120)

for file_path in file_paths:
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )

All tasks should have similar footprints in MEM. However, from the below image, one can tell the MEM drops after the first task is finished but soon climbs back up after loading the second task.
image

Expected behavior

Given that two tasks have similar MEM footprints, I would assume the MEM pattern to be repeated but not accumulated.

Minimal but working code sample to ease bug fix for pandarallel team

As the pseudocode I attached above.

@hogan-roblox
Copy link
Author

hogan-roblox commented Mar 9, 2024

I have some updates on this -- it seems that pandarallel.initialize(progress_bar=True, nb_workers=120) has to be re-executed between different data frames. Is it expected?

The below updated code somehow solves the issue for me.

for file_path in file_paths:
    pandarallel.initialize(progress_bar=True, nb_workers=120)
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )
image

This issue is no longer a blocker for me, but I would like to leave open for a while to see if someone else has the same issue and whether this is an expected behavior.

@hogan-roblox hogan-roblox changed the title Memory usage continuously increases over time Memory usage increases across multiple parallel_apply Mar 11, 2024
@shermansiu
Copy link

shermansiu commented Apr 27, 2024

Could you please attach a sample CSV and the simplest SOME_FUNCTION for which you can reproduce your error?

I'm unable to reproduce your problems with the memory usage.

Python: 3.10.13
Pandarallel: 1.6.5
Pandas: 2.2.0

import pandas as pd
import pandarallel

pandarallel.pandarallel.initialize(progress_bar=True, nb_workers=120)

for _ in range(10):
    df = pd.DataFrame({"foo": range(100_000)})
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(lambda x: x+1, axis=1).to_dict(),
        orient="columns",
    )

You mentioned that this issue is no longer a blocker for you, so if you don't reply in a while, this issue should probably be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants