Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot load reward model from SFT model because of missing keys #137

Open
DZ9 opened this issue Apr 1, 2024 · 8 comments
Open

cannot load reward model from SFT model because of missing keys #137

DZ9 opened this issue Apr 1, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@DZ9
Copy link

DZ9 commented Apr 1, 2024

I converted a llama model to nemo, with model dirs like below:
image
When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True, the total load code is as below:

ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )

And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?

Error executing job with overrides: []
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
    ptl_model = load_from_nemo(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
    model = cls.restore_from(
  File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
    output = super().restore_from(*args, **kwargs)
  File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found

@DZ9 DZ9 added the bug Something isn't working label Apr 1, 2024
@DZ9
Copy link
Author

DZ9 commented Apr 2, 2024

anybody can please help with this?

@odelalleau
Copy link
Collaborator

Did you try with strict=False?

@gshennvm
Copy link
Collaborator

gshennvm commented Apr 4, 2024

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

@DZ9
Copy link
Author

DZ9 commented Apr 9, 2024

Did you try with strict=False?

yes, it didn't work either

@DZ9
Copy link
Author

DZ9 commented Apr 9, 2024

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

yes it is a mcore based model
image

@DZ9
Copy link
Author

DZ9 commented Apr 9, 2024

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

@gshennvm
Copy link
Collaborator

gshennvm commented Apr 9, 2024

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

ah okay! that's good to know. can you elaborate on the change of megatron? was your model SFTed in a previous container?

@odelalleau
Copy link
Collaborator

To elaborate, it'd be helpful if you could share the exact steps you used when you said "I converted a llama model to nemo", so that we can reproduce the issue. Which container did you use and which commands did you run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants