Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

实现断点恢复的逻辑和自动选择可用端口 #728

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

LEON-gittech
Copy link

实现断点恢复的逻辑和自动选择可用端口,端口不可用的报错还是很常见的,对于公司内部使用来说不是很方便,因为并不是在本地运行,每次都要提交一个任务,报错的话又要重新提交一个任务

@staoxiao
Copy link
Collaborator

@LEON-gittech , 感谢您的提交。
但是这个PR里面存在很多无关的内容,比如debug_args.ipynb 和一些中文注释。
另外,huggingface trainer有断点重启的功能,基于其实现会更加简洁。可以简单的在train()里设置resume_from_checkpoint=True

  if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
      trainer.train(resume_from_checkpoint=True)
  else:
      trainer.train()

@LEON-gittech
Copy link
Author

我这么实现的原因在于很多情况下我并不是从头开始训练的,我是基于别人微调过的模型或者预训练的模型做进一步的训练,那我又不能修改原本的模型路径,也就是我需要去创建一个新的路径来存放我的checkpoint,还有就是如果 resume_from_checkpoint=True 的话 checkpoint 应该是会保存在本地的,但比如公司里面很常见的场景是起的任务是一个暂时的环境,所以肯定是不能把 checkpoint 保存在环境本地的,需要指定到一个hdfs 或者 nas 地址,这样即便这个任务被 kill 了但 checkpoint 还在。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants