Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLA related ptxas version error when changing batch size #66716

Closed
andremfreitas opened this issue Apr 30, 2024 · 4 comments
Closed

XLA related ptxas version error when changing batch size #66716

andremfreitas opened this issue Apr 30, 2024 · 4 comments
Assignees

Comments

@andremfreitas
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.16

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04.3 LTS

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

A100 40GB

Current behavior?

I have a custom training loop that calls functions that are jit compiled. I got this error message you see below when using a batch size of 512. However, if I change the batch size to 256 for example (or 128), I no longer get this error. This is very weird, because the error about the ptxas version (which from my understanding is related with the CUDA toolkit version) has nothing to do with the batch size. So, I think the batch size of 512 may be causing another error (possibly some memory issue?) and the wrong error is being thrown ... I am not sure, but let me know what you think.

Thanks and sorry for not being able to provide MWE.

Standalone code to reproduce the issue

Cannot build a MWE unfortunately.

Relevant log output

2024-04-30 16:37:32.740441: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:580 : INTERNAL: XLA requires ptxas version 11.8 or higher
2024-04-30 16:37:32.740516: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: XLA requires ptxas version 11.8 or higher
	 [[{{node PartitionedCall}}]]
Traceback (most recent call last):
  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 281, in <module>
    loss = training_loop(ic, gt, msteps_sched[j])    
  File "/home/ids/afreitas/my_tf/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ids/afreitas/my_tf/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node PartitionedCall defined at (most recent call last):
  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 281, in <module>

  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 191, in training_loop

  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 192, in training_loop

XLA requires ptxas version 11.8 or higher
	 [[{{node PartitionedCall}}]] [Op:__inference_training_loop_26507]
@Venkat6871
Copy link

Hi @andremfreitas ,

  • Ensure that you have CUDA Toolkit version 11.8 or higher installed on your system. You can download the latest version from the NVIDIA website and follow the installation instructions specific to your operating system and CUDA version.
  • Verify that your TensorFlow version is compatible with the CUDA Toolkit version you are using. Some TensorFlow versions may require specific CUDA Toolkit versions for full compatibility.
  • Here i am providing tensorflow documentation for your reference.

Thank you!

@Venkat6871 Venkat6871 added stat:awaiting response Status - Awaiting response from author comp:xla XLA labels May 2, 2024
@andremfreitas
Copy link
Author

CUDA 12.3
cuDNN 8.9

These versions should be compatible with tensorflow 2.16.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 2, 2024
@andremfreitas
Copy link
Author

Turns out it was an issue with a specific node of the cluster ... sorry about that

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants