Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New mle abort subcmd - Clean experiment termination #68

Open
RobertTLange opened this issue May 13, 2021 · 1 comment
Open

New mle abort subcmd - Clean experiment termination #68

RobertTLange opened this issue May 13, 2021 · 1 comment
Assignees
Labels
core-func Core functionality

Comments

@RobertTLange
Copy link
Collaborator

RobertTLange commented May 13, 2021

I would like to have a subcommand that terminates all jobs associated with an experiment and removes all generated files/the trace of it. Otherwise one has to manually use qdel, scancel or gcloud compute instances delete. This could for example be mle abort <experiment_id> or simply mle abort with an additional user Q/A afterwards (check if the status of the experiment is running). A simple procedure could look as follows:

  1. Print summary filtered by status being running and get experiment from cmd args or user.
  2. Check if experiment_id is in db and status is running. Repeat Q if not.
  3. Get job name from single_job_args.job_name in DB.
  4. Delete all jobs starting with a job_name. This will depend on the resource.
  5. Delete all files in experiment_dir.
  6. [Maybe 1. instead] Set the experiment status to aborted in the DB and push it back to GCP.
  • Main problem: Grid search experiments launch new jobs based on job termination. How do we circumvent this?
  • Potential Solution: Update database between grid search batches and check if the status was set to aborted. If no: Update batch counter in database. If yes: Stop launching new jobs. Break out of hyperparameter run. This also has the advantage that we can also show the current batch iteration in mle monitor.

Also allow user to choose between termination via experiment config .yaml and experiment_id.

Note: Give credit to Tudor's Liftoff package.

@RobertTLange RobertTLange added the core-func Core functionality label May 13, 2021
@RobertTLange RobertTLange self-assigned this May 13, 2021
@RobertTLange RobertTLange changed the title New mle abort subcommand New mle abort subcommand - Clean experiment termination May 13, 2021
@RobertTLange RobertTLange changed the title New mle abort subcommand - Clean experiment termination New mle abort subcmd - Clean experiment termination May 13, 2021
@RobertTLange
Copy link
Collaborator Author

It would be great to have a keyboard interrupt wrapper that cleans up the protocol/VM instances. Have a look at this thread: https://stackoverflow.com/questions/1187970/how-to-exit-from-python-without-traceback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-func Core functionality
Projects
None yet
Development

No branches or pull requests

1 participant