Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] Continue training from pre-trained checkpoint #1496

Open
orrzohar opened this issue May 9, 2024 · 1 comment
Open

[Usage] Continue training from pre-trained checkpoint #1496

orrzohar opened this issue May 9, 2024 · 1 comment

Comments

@orrzohar
Copy link

orrzohar commented May 9, 2024

Describe the issue

Issue:

Command:

resuming training from pre-tained model (sudden quit)

Log:

Last 10 lines of StdErr:
  File "/train/train_mem.py", line 13, in <module>
    train()
  File "//train/train.py", line 1295, in train
    trainer.train(resume_from_checkpoint=True)
  File "/transformers/trainer.py", line 1850, in train
    state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))
  File "/transformers/trainer_callback.py", line 148, in load_from_json
    with open(json_path, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './work_dirs/llava/checkpoint-1000/trainer_state.json'

It seems that the model does not save the trainer_state.json during pre-training. is there a way to include this so it would be possible to resume training?

@ashmalvayani
Copy link

Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint.

You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants