Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get OOM errors to stderr and the UI #1696

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Jan 25, 2024

  1. Get OOM errors to stderr and the UI

    There is a race between Metaflow detecting that a pod failed execution
    and the reason for pod failure being set on the pod. As a result,
    at times, the failure reason doesn't posted to stderr.
    
    This change makes Metaflow try a little harder to get the reason for
    the failures.
    
    For pods that get OOM killed, this change worked just fine.
    
    Without this change, the OOM killed pod would simply die and the user
    would have no idea why. With this change, the error on stderr shows:
    
    Task ran out of memory. Increase the available memory by specifying @resource(memory=...) for the step.
    shrinandj committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    8c750f6 View commit details
    Browse the repository at this point in the history