Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PEP517 compatible build backend #3991

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

joeyearsley
Copy link

@joeyearsley joeyearsley commented Oct 5, 2023

Checklist before submitting

  • [ X] Did you read the contributor guide?
  • [ X] Did you update the docs?
  • Did you write any tests to validate this change?
  • [ X] Did you update the CHANGELOG, if this change affects users?

Description

Fixes #3697

The build process has yet to comply with PEP517, causing issues for users utilising Poetry and soon more build systems.

This MR alters some pre-existing environment variables and necessary CMake instructions when building Horovod to make a complaint build system.

An example local install now, the tar can be replaced with pypi post PR:

export HOROVOD_WITH_TENSORFLOW=2.8.4
pip install --no-cache-dir --use-pep517 "horovod/dist/horovod-0.28.1.tar.gz"

Since PEP517 build every requirement in isolation Horovod now doesn't know which versions of TF/PyTorch are already installed. It can't see which libs to build against.

We can set any version in build_requires as many users have many versions. For example, using TF 2.13.1 in the build_requires would cause linking issues to be reported when using in a venv with TF 2.8.4.

To enable an isolated build whilst also respecting the users' environment, I've utilised HOROVOD_WITH_{MXNET|PYTORCH|TENSORFLOW} to specify versions for which the isolated build environment should install such that Horovod can now be built against the correct library specification in isolation.
Whilst also working when it is moved out of the isolated build env to its final non-isolated location.

I've updated key error messages to alert users when they might not be using the correct env var flags.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Joe Yearsley added 4 commits October 5, 2023 22:04
Signed-off-by: Joe Yearsley <[email protected]>
Signed-off-by: Joe Yearsley <[email protected]>
Update version

Signed-off-by: Joe Yearsley <[email protected]>
Joe Yearsley added 2 commits October 5, 2023 22:32
Signed-off-by: Joe Yearsley <[email protected]>
Signed-off-by: Joe Yearsley <[email protected]>
@joeyearsley
Copy link
Author

joeyearsley commented Oct 11, 2023

Note the outstanding error seems to be a CI problem:

java.io.IOException: Failed to run image 'tensorflowppc64le/tensorflow-ppc64le:osuosl-ubi7-horovod-opence1.4.1-py3.9-ppc64le'. Error: docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.

@joeyearsley
Copy link
Author

Is horovod maintained anymore?

@leweex95
Copy link

leweex95 commented Nov 7, 2023

It's sad to see no reaction from horovod developers to this crucial PR. I just tried to install horovod via poetry directly from this PR:

export HOROVOD_WITH_PYTORCH=2.1.0
poetry add -vvv git+https://github.com/horovod/horovod.git@refs/pull/3991/merge

but it fails with the following error:

  ChefBuildError

  Backend subprocess exited when trying to invoke build_wheel
  
  running bdist_wheel
  running build
  running build_py
  running build_ext
  Running CMake in build/temp.linux-x86_64-cpython-38/RelWithDebInfo:
  cmake /home/myproj/.venv/src/horovod -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/home/myproj/.venv/src/horovod/build/lib.linux-x86_64-cpython-38 -DPYTHON_EXECUTABLE:FILEPATH=/tmp/tmpgfgrfzib/.venv/bin/python
  cmake --build . --config RelWithDebInfo -- -j8 VERBOSE=1
  -- Could not find CCache. Consider installing CCache to speed up compilation.
  -- Build architecture flags: -mf16c -mavx -mfma
  -- Using command /tmp/tmpgfgrfzib/.venv/bin/python
  -- Could NOT find MPI_CXX (missing: MPI_CXX_LIB_NAMES MPI_CXX_HEADER_DIR MPI_CXX_WORKS) 
  -- Could NOT find MPI (missing: MPI_CXX_FOUND) 
  -- Could not find nvcc, please set CUDAToolkit_ROOT.
  -- Could NOT find NVTX (missing: NVTX_INCLUDE_DIR) 
  CMake Deprecation Warning at third_party/gloo/CMakeLists.txt:1 (cmake_minimum_required):
    Compatibility with CMake < 3.5 will be removed from a future version of
    CMake.
  
    Update the VERSION argument <min> value or use a ...<max> suffix to tell
    CMake that the project does not need compatibility with older versions.
  
  
  -- Gloo build as STATIC library
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
  ModuleNotFoundError: No module named 'tensorflow'
  -- Could NOT find Tensorflow (missing: Tensorflow_LIBRARIES) (Required is at least version "1.15.0")
  Traceback (most recent call last):
    File "/tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages/torch/__init__.py", line 174, in _load_global_deps
      ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
      self._handle = _dlopen(self._name, mode)
  OSError: libcufft.so.11: cannot open shared object file: No such file or directory
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages/torch/__init__.py", line 234, in <module>
      _load_global_deps()
    File "/tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages/torch/__init__.py", line 195, in _load_global_deps
      _preload_cuda_deps(lib_folder, lib_name)
    File "/tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages/torch/__init__.py", line 160, in _preload_cuda_deps
      raise ValueError(f"{lib_name} not found in the system path {sys.path}")
  ValueError: libcublas.so.*[0-9] not found in the system path ['', '/usr/lib/python38.zip', '/usr/lib/python3.8', '/usr/lib/python3.8/lib-dynload', '/tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages']
  CMake Error at /tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
    Could NOT find Pytorch: (Required is at least version "1.5.0") (found )
  Call Stack (most recent call first):
    /tmp/tmpgfgrfzib/.venv/lib/python3.8/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:598 (_FPHSA_FAILURE_MESSAGE)
    cmake/Modules/FindPytorch.cmake:20 (find_package_handle_standard_args)
    horovod/torch/CMakeLists.txt:12 (find_package)

...

    File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '/home/myproj/.venv/src/horovod', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/home/myproj/.venv/src/horovod/build/lib.linux-x86_64-cpython-38', '-DPYTHON_EXECUTABLE:FILEPATH=/tmp/tmpgfgrfzib/.venv/bin/python']' returned non-zero exit status 1.

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Few comments.

@@ -1,3 +1,3 @@
from horovod.runner import run

__version__ = '0.28.1'
__version__ = '0.29.0'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bump the version, this is done during next release.

@@ -236,7 +236,7 @@ RUN if [[ ${MPI_KIND} == "ONECCL" ]]; then \
fi; \
cd /horovod && \
python setup.py sdist && \
bash -c "${HOROVOD_BUILD_FLAGS} HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir -v $(ls /horovod/dist/horovod-*.tar.gz)[spark,ray]"
Copy link
Collaborator

@EnricoMi EnricoMi Dec 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a breaking change, as Horovod can still be installed via the old HOROVOD_WITH_*=1 vars using --no-build-isolation, right?

HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-build-isolation ...

Can we somehow imply the --no-build-isolation when those HOROVOD_WITH_* vars are 1? Otherwise this may be considered a breaking change...

@@ -221,11 +234,13 @@ def get_average_backwards_compatibility_fun(reduce_ops):
def impl(op, average):
if op is not None:
if average is not None:
raise ValueError('The op parameter supersedes average. Please provide only one of them.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move these reformatting changes into a separate PR?

@_cache
def ccl_built(verbose=False):
for ext_base_name in EXTENSIONS:
built_fn = lambda ext: ext.ccl_built()
def built_fn(ext): return ext.ccl_built()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no functional change, just syntax, right?

@@ -26,12 +26,14 @@
import textwrap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't setup.py be removed, or is this for backward-compatibility?

Copy link

github-actions bot commented Jan 10, 2024

Unit Test Results

  223 files   -    501    223 suites   - 501   1h 50m 57s ⏱️ - 6h 15m 10s
  805 tests  -     82    605 ✅  -   163    200 💤 +   81  0 ❌ ±0 
4 996 runs   - 11 213  3 329 ✅  - 8 024  1 667 💤  - 3 189  0 ❌ ±0 

Results for commit 7c6a96a. ± Comparison against base commit 9f88e1d.

This pull request removes 82 tests.
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_down_by_discovery
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_down_by_exception
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_no_spark_black_list
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_executor_reuse
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_in_app
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_same_task
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_node_reuse
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_node_reuse_in_app
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_up
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_fault_tolerance_all_hosts_lost
…
This pull request skips 95 tests.
test.integration.test_interactiverun.InteractiveRunTests ‑ test_happy_run_elastic
test.parallel.test_keras.KerasTests ‑ test_elastic_state
test.parallel.test_keras.KerasTests ‑ test_from_config
test.parallel.test_keras.KerasTests ‑ test_load_model
test.parallel.test_keras.KerasTests ‑ test_load_model_broadcast
test.parallel.test_keras.KerasTests ‑ test_load_model_custom_objects
test.parallel.test_keras.KerasTests ‑ test_load_model_custom_optimizers
test.parallel.test_keras.KerasTests ‑ test_sparse_as_dense
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
…

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Jan 10, 2024

Unit Test Results (with flaky tests)

  223 files   -    665    223 suites   - 665   1h 50m 57s ⏱️ - 7h 7m 44s
  805 tests  -     82    605 ✅  -    163    200 💤 +   81  0 ❌ ±0 
4 996 runs   - 15 243  3 329 ✅  - 10 460  1 667 💤  - 4 783  0 ❌ ±0 

Results for commit 7c6a96a. ± Comparison against base commit 9f88e1d.

This pull request removes 82 tests.
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_down_by_discovery
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_down_by_exception
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_no_spark_black_list
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_executor_reuse
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_in_app
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_same_task
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_node_reuse
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_spark_blacklist_no_node_reuse_in_app
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_auto_scale_up
test.integration.test_elastic_spark_tensorflow.ElasticSparkTensorflowTests ‑ test_fault_tolerance_all_hosts_lost
…
This pull request skips 95 tests.
test.integration.test_interactiverun.InteractiveRunTests ‑ test_happy_run_elastic
test.parallel.test_keras.KerasTests ‑ test_elastic_state
test.parallel.test_keras.KerasTests ‑ test_from_config
test.parallel.test_keras.KerasTests ‑ test_load_model
test.parallel.test_keras.KerasTests ‑ test_load_model_broadcast
test.parallel.test_keras.KerasTests ‑ test_load_model_custom_objects
test.parallel.test_keras.KerasTests ‑ test_load_model_custom_optimizers
test.parallel.test_keras.KerasTests ‑ test_sparse_as_dense
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
…

♻️ This comment has been updated with latest results.

@EnricoMi EnricoMi force-pushed the PEP517 branch 2 times, most recently from d631681 to 4b2c05a Compare January 10, 2024 20:01
Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Horovod fails to install via Poetry
4 participants