Memory usage increases across multiple `parallel_apply` #264

hogan-roblox · 2024-03-04T19:56:13Z

General

Operating System: Linux
Python version: 3.10.8
Pandas version: 1.5.3
Pandarallel version: 1.6.5

Acknowledgement

My issue is NOT present when using pandas without alone (without pandarallel)
If I am on Windows, I read the Troubleshooting page
before writing a new bug report

Bug description

If I run continuous data processing tasks, each with a huge DataFrame using parallel_apply , their MEM footprints somehow accumulates.

Observed behavior

My code logic looks like below.

pandarallel.initialize(progress_bar=True, nb_workers=120)

for file_path in file_paths:
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )

All tasks should have similar footprints in MEM. However, from the below image, one can tell the MEM drops after the first task is finished but soon climbs back up after loading the second task.

Expected behavior

Given that two tasks have similar MEM footprints, I would assume the MEM pattern to be repeated but not accumulated.

Minimal but working code sample to ease bug fix for `pandarallel` team

As the pseudocode I attached above.

The text was updated successfully, but these errors were encountered:

hogan-roblox · 2024-03-09T02:06:06Z

I have some updates on this -- it seems that pandarallel.initialize(progress_bar=True, nb_workers=120) has to be re-executed between different data frames. Is it expected?

The below updated code somehow solves the issue for me.

for file_path in file_paths:
    pandarallel.initialize(progress_bar=True, nb_workers=120)
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )

This issue is no longer a blocker for me, but I would like to leave open for a while to see if someone else has the same issue and whether this is an expected behavior.

shermansiu · 2024-04-27T11:42:46Z

Could you please attach a sample CSV and the simplest SOME_FUNCTION for which you can reproduce your error?

I'm unable to reproduce your problems with the memory usage.

Python: 3.10.13
Pandarallel: 1.6.5
Pandas: 2.2.0

import pandas as pd
import pandarallel

pandarallel.pandarallel.initialize(progress_bar=True, nb_workers=120)

for _ in range(10):
    df = pd.DataFrame({"foo": range(100_000)})
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(lambda x: x+1, axis=1).to_dict(),
        orient="columns",
    )

You mentioned that this issue is no longer a blocker for you, so if you don't reply in a while, this issue should probably be closed.

hogan-roblox changed the title ~~Memory usage continuously increases over time~~ Memory usage increases across multiple parallel_apply Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage increases across multiple `parallel_apply` #264

Memory usage increases across multiple `parallel_apply` #264

hogan-roblox commented Mar 4, 2024 •

edited

hogan-roblox commented Mar 9, 2024 •

edited

shermansiu commented Apr 27, 2024 •

edited

Memory usage increases across multiple parallel_apply #264

Memory usage increases across multiple parallel_apply #264

Comments

hogan-roblox commented Mar 4, 2024 • edited

General

Acknowledgement

Bug description

Observed behavior

Expected behavior

Minimal but working code sample to ease bug fix for pandarallel team

hogan-roblox commented Mar 9, 2024 • edited

shermansiu commented Apr 27, 2024 • edited

Memory usage increases across multiple `parallel_apply` #264

Memory usage increases across multiple `parallel_apply` #264

hogan-roblox commented Mar 4, 2024 •

edited

Minimal but working code sample to ease bug fix for `pandarallel` team

hogan-roblox commented Mar 9, 2024 •

edited

shermansiu commented Apr 27, 2024 •

edited