Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reraise exception back to user #1715

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

reraise exception back to user #1715

wants to merge 4 commits into from

Conversation

gwenzek
Copy link
Contributor

@gwenzek gwenzek commented Oct 13, 2022

@Mortimerp9 was complaining about the "UncompletedJobError" not being very precise.

There can be several cases:

  1. User job crashed because of a system error, in that case we could retry
  2. User job crashed violently and did not left a pickle behind
  3. Job timed out and submitit chose not to resubmitit (either because already tried several times or because not checkpointable)

The idea of this PR is to pickle the full exception and reraise it on the executor side when the user is calling job.result().
This will change '1' into the original exception breaking change
This require pickling an exception which isn't possible by default.
I've included tblib in our code base. The amount of code is relatively low.
Basically we are converting the error and the stack frame into a Python object and dropping some fields that are can't be pickled.
Then during unpickling we do the reverse operation so that we can create a convincing exception.

While adding tests I realize that submitit would NOT write a result pickle file if the pickling of result would fail.
Because the temp file created in

with utils.temporary_save_path(paths.result_pickle) as tmppath: # save somewhere else, and move
would still be present when trying to save the error
with utils.temporary_save_path(paths.result_pickle) as tmppath:
and this would exit without a pickle file.

Mortimer may have saw this issue. So I've rewritten a bit the pickling logic to leave a pickle behind even if we can't pickle the result or the exception. (some 2)

I also realized that catching "Exception" might not be enough and maybe we should also handle "BaseException" since otherwise we might exit without leaving a result.pkl behind us. But I left this for future investigations.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 13, 2022
@gwenzek gwenzek requested a review from jrapin October 13, 2022 10:54
# Conflicts:
#	submitit/core/submission.py
@gwenzek gwenzek force-pushed the traceback branch 2 times, most recently from cdf8874 to 8e58878 Compare October 13, 2022 11:09
@gwenzek
Copy link
Contributor Author

gwenzek commented Oct 13, 2022

well apparently it's not working correctly for python 3.6 ^^'
I don't have time for investigation right now, but it seems to fail on the Executor side, which is worse since that means we won't be able to get any error context.

@Mortimerp9 you can still try it on Python >= 3.8

if outcome == "error":
if isinstance(original_err, str):
# Normally original_err is an exception, unless we failed to pickle it.
trace = original_err
return utils.FailedJobError(
f"Job (task={self.task_id}) failed during processing with trace:\n"
f"----------------------\n{trace}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use original_err here? For readability?

elif self.paths.stdout.exists():
log = subprocess.check_output(["tail", "-40", str(self.paths.stdout)], encoding="utf-8")
stderr = subprocess.check_output(["tail", "-40", str(self.paths.stdout)], encoding="utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you call it stderr but it's reading stdout

message.extend(
[f"No error stream produced. Look at stdout: {self.paths.stdout}", "-" * 40, log]
[f"No error stream produced. Look at stdout: {self.paths.stdout}", "-" * 40, stderr]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

except Exception as error: # TODO: check pickle methods for capturing traceback; pickling and raising
except Exception as pickle_error:
logger.error(f"Could not pickle job result because of {pickle_error}")
save_error(pickle_error, tmp_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't want to raise the exception higher her like in the other cases?

@Mortimerp9
Copy link
Contributor

this looks good to me, but I am no expert of the core of submitit, I'll try it out with stopes

@facebook-github-bot
Copy link

Hi @gwenzek!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@Mortimerp9
Copy link
Contributor

@jrapin could we merge this one or should we just abandon?

@gwenzek
Copy link
Contributor Author

gwenzek commented Jan 25, 2024

Hi Pierre, the reason I didn't push for it getting merged is that the design is a bit hacky.
It's relying on the structure of the internal traceback object in CPython, which can and has changed between major Python releases, and it's not going to work in every context.
A simpler solution would be just to print the stacktrace and pickle the string, so it can be reprinted again on the executor process. Having a full error object has interesting properties, but I'm not sure it's worth the extra complexity.

@Mortimerp9
Copy link
Contributor

Thanks, we ended up not using it until now as the string was often enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants