cannot load reward model from SFT model because of missing keys #137

DZ9 · 2024-04-01T13:19:33Z

I converted a llama model to nemo, with model dirs like below:

When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True, the total load code is as below:

ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )

And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?

Error executing job with overrides: []
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
    ptl_model = load_from_nemo(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
    model = cls.restore_from(
  File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
    output = super().restore_from(*args, **kwargs)
  File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found

The text was updated successfully, but these errors were encountered:

DZ9 · 2024-04-02T07:43:37Z

anybody can please help with this?

odelalleau · 2024-04-04T02:20:31Z

Did you try with strict=False?

gshennvm · 2024-04-04T22:06:10Z

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

DZ9 · 2024-04-09T03:02:20Z

Did you try with strict=False?

yes, it didn't work either

DZ9 · 2024-04-09T03:04:04Z

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

yes it is a mcore based model

DZ9 · 2024-04-09T03:13:01Z

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

gshennvm · 2024-04-09T05:06:39Z

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

ah okay! that's good to know. can you elaborate on the change of megatron? was your model SFTed in a previous container?

odelalleau · 2024-04-09T11:56:47Z

To elaborate, it'd be helpful if you could share the exact steps you used when you said "I converted a llama model to nemo", so that we can reproduce the issue. Which container did you use and which commands did you run?

DZ9 added the bug Something isn't working label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot load reward model from SFT model because of missing keys #137

cannot load reward model from SFT model because of missing keys #137

DZ9 commented Apr 1, 2024

DZ9 commented Apr 2, 2024

odelalleau commented Apr 4, 2024

gshennvm commented Apr 4, 2024

DZ9 commented Apr 9, 2024

DZ9 commented Apr 9, 2024

DZ9 commented Apr 9, 2024

gshennvm commented Apr 9, 2024

odelalleau commented Apr 9, 2024

cannot load reward model from SFT model because of missing keys #137

cannot load reward model from SFT model because of missing keys #137

Comments

DZ9 commented Apr 1, 2024

DZ9 commented Apr 2, 2024

odelalleau commented Apr 4, 2024

gshennvm commented Apr 4, 2024

DZ9 commented Apr 9, 2024

DZ9 commented Apr 9, 2024

DZ9 commented Apr 9, 2024

gshennvm commented Apr 9, 2024

odelalleau commented Apr 9, 2024