Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable_ema cause runtime error when running train_ppo_llama.sh #245

Open
dshnightmare opened this issue Mar 14, 2024 · 6 comments
Open

enable_ema cause runtime error when running train_ppo_llama.sh #245

dshnightmare opened this issue Mar 14, 2024 · 6 comments

Comments

@dshnightmare
Copy link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!

several codes make ema model on gpu device:

  if args.enable_ema:
      ema_model = deepcopy(actor) # actor is default on gpu
  else:
      ema_model = None
  if ema_model: 
      ema_model._offload = True
      ema_model = strategy.prepare(ema_model, is_rlhf=True)   # this make ema on gpu too
      del ema_model._offload

and the error is here:

if self.ema_model:
      self.strategy.moving_average(self.actor, self.ema_model, self.ema_beta, "cpu")

In moving_average function, actor's parameters are fetched to cpu for calculation, but ema model's parameters are still on gpu

@hijkzzz
Copy link
Collaborator

hijkzzz commented Mar 14, 2024

I have tried to fix it.
Could you please help me try the latest main branch codes?

@dshnightmare
Copy link
Author

I try the code , but it seems that strategy.prepare still makes parameters to gpu. Maybe ema model should not go through strategy.prepare.

@hijkzzz
Copy link
Collaborator

hijkzzz commented Mar 14, 2024

I try the code , but it seems that strategy.prepare still makes parameters to gpu. Maybe ema model should not go through strategy.prepare.

But we must prepare the ema_model for zero3, It is strange why the deepspeed would transfer the offloaded model to the gpu.

@dshnightmare
Copy link
Author

can we make ema_model only loaded on rank-0 and always on cpu, so there is no need for zero3

@hijkzzz
Copy link
Collaborator

hijkzzz commented Mar 14, 2024

can we make ema_model only loaded on rank-0 and always on cpu, so there is no need for zero3

This is okay, but you need to modify the code

@Ricardokevins
Copy link

Same issue here with Zero-3 Training

So what should i do to solve the issue? @hijkzzz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants