New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLA related ptxas version error when changing batch size #66716
Comments
Hi @andremfreitas ,
Thank you! |
Venkat6871
added
stat:awaiting response
Status - Awaiting response from author
comp:xla
XLA
labels
May 2, 2024
CUDA 12.3 These versions should be compatible with tensorflow 2.16. |
google-ml-butler
bot
removed
the
stat:awaiting response
Status - Awaiting response from author
label
May 2, 2024
Turns out it was an issue with a specific node of the cluster ... sorry about that |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
2.16
Custom code
Yes
OS platform and distribution
Linux Ubuntu 22.04.3 LTS
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
A100 40GB
Current behavior?
I have a custom training loop that calls functions that are jit compiled. I got this error message you see below when using a batch size of 512. However, if I change the batch size to 256 for example (or 128), I no longer get this error. This is very weird, because the error about the ptxas version (which from my understanding is related with the CUDA toolkit version) has nothing to do with the batch size. So, I think the batch size of 512 may be causing another error (possibly some memory issue?) and the wrong error is being thrown ... I am not sure, but let me know what you think.
Thanks and sorry for not being able to provide MWE.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: