-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable_ema cause runtime error when running train_ppo_llama.sh #245
Comments
I have tried to fix it. |
I try the code , but it seems that |
But we must prepare the ema_model for zero3, It is strange why the deepspeed would transfer the offloaded model to the gpu. |
can we make ema_model only loaded on rank-0 and always on cpu, so there is no need for zero3 |
This is okay, but you need to modify the code |
Same issue here with Zero-3 Training So what should i do to solve the issue? @hijkzzz |
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!
several codes make ema model on gpu device:
and the error is here:
In moving_average function, actor's parameters are fetched to cpu for calculation, but ema model's parameters are still on gpu
The text was updated successfully, but these errors were encountered: