Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass SIGTERM to training script to stop training #125

Open
bstriner opened this issue May 19, 2022 · 0 comments · May be fixed by #126
Open

Pass SIGTERM to training script to stop training #125

bstriner opened this issue May 19, 2022 · 0 comments · May be fixed by #126

Comments

@bstriner
Copy link

Describe the bug
SIGTERM from StopTrainingJob doesn't appear to be passed to the training subprocess.

To reproduce
Add a SIGTERM handler to a training script, start a training job, then click "Stop". The signal handler will not fire.

Expected behavior
Signal handler should fire when "StopTrainingJob" happens

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system.

  • Include the version of SageMaker Training Toolkit you are using.
  • If you are using a prebuilt Amazon SageMaker Docker image, provide the URL.
  • If you are using a custom Docker image, provide:
    • framework name (eg. PyTorch)
    • framework version
    • Python version
    • processing unit type (ie. CPU or GPU)

Additional context
Add any other context about the problem here.

bstriner added a commit to bstriner/sagemaker-training-toolkit that referenced this issue May 19, 2022
feature: Pass SIGTERM to training subprocess
fix: aws#125
@bstriner bstriner linked a pull request May 19, 2022 that will close this issue
6 tasks
bstriner added a commit to bstriner/sagemaker-training-toolkit that referenced this issue May 20, 2022
feature: Pass SIGTERM to training subprocess
fix: aws#125
bstriner added a commit to bstriner/sagemaker-training-toolkit that referenced this issue May 20, 2022
feature: Pass SIGTERM to training subprocess
fix: aws#125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant