Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Local execution job stays running in the UI if Python process dies #162

Open
shazraz opened this issue May 21, 2020 · 1 comment
Open
Labels
bug Something isn't working
Projects

Comments

@shazraz
Copy link

shazraz commented May 21, 2020

When running a local job but logging in Atlas, the job remains stuck in a running state if the underlying python process dies for some reason (OOM issues). The job cannot then be manipulated on the Atlas UI.

It would be ideal if this job could be failed automatically by Atlas if the underlying process dies. If not, then the user should have the ability to "stop" these phantom jobs which should then appear as failed.

This is related to #77 and #137

image

@shazraz shazraz added the bug Something isn't working label May 21, 2020
@ekhl
Copy link
Contributor

ekhl commented May 21, 2020

On the issue of having the wrong status displayed to the user: one proposal is to change the job status update mechanism to a heartbeat mechanism, since that these jobs can be executed locally (i.e. no natural way to supervise the job like a job running in the scheduler's cluster). Are there alternatives that can capture these catastrophic failure modes?

@mohammedri mohammedri added this to To do in Atlas 🚀 May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Atlas 🚀
  
To do
Development

No branches or pull requests

2 participants