Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.12.1 with pandarallel==1.6.5 usage of parallel_apply time increase X3 #261

Open
mdclone-oa opened this issue Dec 11, 2023 · 1 comment

Comments

@mdclone-oa
Copy link

mdclone-oa commented Dec 11, 2023

General

  • Operating System: 8.9 (Ootpa)
  • Python version: 3.12.1
  • Pandas version: 2.1.3
  • Pandarallel version: 1.6.5

Acknowledgement

after upgrading to Python 3.12 from Python 3.10 the usage of parallel_apply increased almost X3.
running on docker with 8.9 (Ootpa)

this is the information about the OS that the docker is running

NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Python 3.12 packages

annotated-types==0.6.0
astroid==3.0.1
attrs==23.1.0
Cerberus==1.3.5
certifi==2023.11.17
charset-normalizer==3.3.2
contourpy==1.2.0
coverage==7.3.2
cycler==0.12.1
debugpy==1.8.0
dill==0.3.7
distlib==0.3.7
docopt==0.6.2
execnet==2.0.2
fonttools==4.46.0
idna==3.6
iniconfig==2.0.0
isort==5.13.0
Jinja2==3.1.2
joblib==1.3.2
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
kiwisolver==1.4.5
MarkupSafe==2.1.3
matplotlib==3.8.2
mccabe==0.7.0
mlxtend==0.23.0
numpy==1.26.2
packaging==23.2
pandarallel==1.6.5
pandas==2.1.3
pep517==0.13.1
pika==1.3.2
Pillow==10.1.0
pip-api==0.0.30
pipreqs==0.4.13
platformdirs==4.1.0
plette==0.4.4
pluggy==1.3.0
psutil==5.9.6
py-cpuinfo==9.0.0
pydantic==2.5.2
pydantic_core==2.14.5
pylint==3.0.2
pyparsing==3.1.1
pytest==7.4.3
pytest-benchmark==4.0.0
pytest-cov==4.1.0
pytest-html==4.1.1
pytest-metadata==3.0.0
pytest-mock==3.12.0
pytest-order==1.2.0
pytest-ordering==0.6
pytest-timeout==2.2.0
pytest-xdist==3.4.0
python-dateutil==2.8.2
pytz==2023.3.post1
redis==5.0.1
referencing==0.32.0
requests==2.31.0
requirementslib==3.0.0
rpds-py==0.13.2
scikit-learn==1.3.2
scipy==1.11.4
seaborn==0.13.0
setuptools==68.2.2
six==1.16.0
threadpoolctl==3.2.0
tomlkit==0.12.3
typing_extensions==4.9.0
tzdata==2023.3
urllib3==2.1.0
yarg==0.1.9

I can't add all my code but this is some of it.

results = combined.groupby(by='NewGroup').parallel_apply(
            lambda group: TestClass(data=group.drop(columns=columns, inplace=False)).run())

TestClass - init the class with the new data after the drop
columns - is a list of columns that we need to drop
run - is the function that runs on each group

the servers are the same and the code didn't change, but still, I got time increased almost by X3

with python 3.10.11 with pandarallel==1.6.5 and pandas==2.0.0
the same data frame takes 2.49 min and with the 3.12.1 it takes 7.22 min

@nalepae
Copy link
Owner

nalepae commented Jan 23, 2024

Pandaral·lel is looking for a maintainer!
If you are interested, please open an GitHub issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants