enable_ema cause runtime error when running train_ppo_llama.sh #245

dshnightmare · 2024-03-14T02:29:39Z

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!

several codes make ema model on gpu device:

  if args.enable_ema:
      ema_model = deepcopy(actor) # actor is default on gpu
  else:
      ema_model = None

  if ema_model: 
      ema_model._offload = True
      ema_model = strategy.prepare(ema_model, is_rlhf=True)   # this make ema on gpu too
      del ema_model._offload

and the error is here:

if self.ema_model:
      self.strategy.moving_average(self.actor, self.ema_model, self.ema_beta, "cpu")

In moving_average function, actor's parameters are fetched to cpu for calculation, but ema model's parameters are still on gpu

The text was updated successfully, but these errors were encountered:

hijkzzz · 2024-03-14T03:16:27Z

I have tried to fix it.
Could you please help me try the latest main branch codes?

dshnightmare · 2024-03-14T05:31:06Z

I try the code , but it seems that strategy.prepare still makes parameters to gpu. Maybe ema model should not go through strategy.prepare.

hijkzzz · 2024-03-14T07:10:37Z

I try the code , but it seems that strategy.prepare still makes parameters to gpu. Maybe ema model should not go through strategy.prepare.

But we must prepare the ema_model for zero3, It is strange why the deepspeed would transfer the offloaded model to the gpu.

dshnightmare · 2024-03-14T07:34:44Z

can we make ema_model only loaded on rank-0 and always on cpu, so there is no need for zero3

hijkzzz · 2024-03-14T10:06:08Z

can we make ema_model only loaded on rank-0 and always on cpu, so there is no need for zero3

This is okay, but you need to modify the code

Ricardokevins · 2024-05-23T10:01:13Z

Same issue here with Zero-3 Training

So what should i do to solve the issue? @hijkzzz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable_ema cause runtime error when running train_ppo_llama.sh #245

enable_ema cause runtime error when running train_ppo_llama.sh #245

dshnightmare commented Mar 14, 2024

hijkzzz commented Mar 14, 2024

dshnightmare commented Mar 14, 2024

hijkzzz commented Mar 14, 2024 •

edited

dshnightmare commented Mar 14, 2024

hijkzzz commented Mar 14, 2024

Ricardokevins commented May 23, 2024

enable_ema cause runtime error when running train_ppo_llama.sh #245

enable_ema cause runtime error when running train_ppo_llama.sh #245

Comments

dshnightmare commented Mar 14, 2024

hijkzzz commented Mar 14, 2024

dshnightmare commented Mar 14, 2024

hijkzzz commented Mar 14, 2024 • edited

dshnightmare commented Mar 14, 2024

hijkzzz commented Mar 14, 2024

Ricardokevins commented May 23, 2024

hijkzzz commented Mar 14, 2024 •

edited