Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider taking in global environment variable for nb_workers and possibly other parameters too #257

Open
SiRumCz opened this issue Nov 14, 2023 · 3 comments

Comments

@SiRumCz
Copy link

SiRumCz commented Nov 14, 2023

Please write here what feature pandarallel is missing: Would like to control the number of workers being generation without touching the code for different machines.
Example: A new pandas API which is not (yet) supported by pandarallel.

@nalepae
Copy link
Owner

nalepae commented Jan 23, 2024

Pandaral·lel is looking for a maintainer!
If you are interested, please open an GitHub issue.

@shermansiu
Copy link

I'd prefer to keep the code in the core package minimal to make things easier to maintain.

Wouldn't you be able to maintain the same functionality by reading in the environment variables with os.environ and passing them to pandarallel.initialize?

IMO this is a wontfix issue, unless a compelling reason is given.

@IceFreez3r
Copy link

IceFreez3r commented May 13, 2024

One of the tools I use, that uses pandarallel, fails consistently in a cluster environment with out-of-memory errors.
According to vladr on SO:

Memory-wise, we already know that subprocess.Popen uses fork/clone under the hood, meaning that every time you call it you're requesting once more as much memory as Python is already eating up, i.e. in the hundreds of additional MB, all in order to then exec a puny 10kB executable such as free or ps. In the case of an unfavourable overcommit policy, you'll soon see ENOMEM.

This wouldn't be a problem in the general case, but overcommiting memory is disabled on the cluster. Since the cluster comes with a lot of cores this easily eats up the entire RAM, even for processes that would be fine with 10GB of memory.
I've ran the tool with the exact same commands on a working machine with a few cores and overcommiting enabled and it worked fine.

If I could just limit the number of workers/subprocesses this problem wouldn't occur.

Edit: Also note that I cannot just edit pandarallel.initialize since I'm using the code from someone else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants