Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metaflow crashes on AWS Batch if folder called metaflow is present in the working directory #1672

Open
shchur opened this issue Jan 9, 2024 · 5 comments

Comments

@shchur
Copy link

shchur commented Jan 9, 2024

I have created the stack using the MetaFlow CloudFormation template.

I can run flows locally, but flows fail if I add the @batch decorator.

For example, this flow

Code
from metaflow import FlowSpec, step, batch


class HelloFlow(FlowSpec):
    @step
    def start(self):
        print("HelloFlow is starting.")
        self.next(self.hello)

    @batch
    @step
    def hello(self):
        print("Metaflow says: Hi!")
        self.next(self.end)

    @step
    def end(self):
        print("HelloFlow is all done.")


if __name__ == "__main__":
    HelloFlow()

fails with the following error message

Metaflow 2.10.3 executing HelloFlow for user:shchuro
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-01-09 22:26:54.668 Workflow starting (run-id 1704839212206922):
2024-01-09 22:26:54.832 [1704839212206922/start/1 (pid 31890)] Task is starting.
2024-01-09 22:26:57.629 [1704839212206922/start/1 (pid 31890)] HelloFlow is starting.
2024-01-09 22:26:59.947 [1704839212206922/start/1 (pid 31890)] Task finished successfully.
2024-01-09 22:27:01.262 [1704839212206922/hello/2 (pid 32009)] Task is starting.
2024-01-09 22:27:06.207 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status SUBMITTED)...
2024-01-09 22:27:08.452 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status RUNNABLE)...
2024-01-09 22:27:10.692 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status STARTING)...
2024-01-09 22:27:25.274 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status FAILED)...
2024-01-09 22:27:27.758 [1704839212206922/hello/2 (pid 32009)] Data store error:
2024-01-09 22:27:27.759 [1704839212206922/hello/2 (pid 32009)] No completed attempts of the task was found for task 'HelloFlow/1704839212206922/hello/2'
2024-01-09 22:27:28.156 [1704839212206922/hello/2 (pid 32009)]
2024-01-09 22:27:28.554 [1704839212206922/hello/2 (pid 32009)] Task failed.
2024-01-09 22:27:28.888 Workflow failed.
2024-01-09 22:27:28.888 Terminating 0 active tasks...
2024-01-09 22:27:28.888 Flushing logs...
    Step failure:
    Step hello (task-id 2) failed.

The S3 folder for the failing step (HellowFlow/1704839212206922/hello/2) contains the following files:

  • 0.attempt.json
  • 0.runtime
  • 0.runtime_stderr.log
  • 0.runtime_stdout.log

I would really appreciate any pointers on how to debug this issue.

@madhur-ob
Copy link
Collaborator

Hey @shchur, I wasn't able to reproduce this...attaching a screenshot of a successful run with @batch

Screenshot 2024-01-10 at 7 55 49 AM

Could you perhaps:

  • post the contents of 0.runtime_stderr.log and 0.runtime_stdout.log
  • try with a newer metaflow version i.e. 2.10.8 since you use 2.10.3

@shchur
Copy link
Author

shchur commented Jan 10, 2024

Thank you for a quick response! I tried using 2.10.8 and encountered the same error. Here are the contents of s3://****-metaflows3bucket-vqcend9aqn9z/HelloFlow/1704868945391404/hello/2/:

  • 0.runtime_stderr.log:
[MFLOG|0|2024-01-10T06:46:03.311728Z|runtime|91c794bc-1325-4cc5-98d7-3939daef119f]    Data store error:
[MFLOG|0|2024-01-10T06:46:03.311985Z|runtime|735f3f38-9a60-4978-aefb-6f69aff0d75a]    No completed attempts of the task was found for task 'HelloFlow/1704868945391404/hello/2'
[MFLOG|0|2024-01-10T06:46:03.691040Z|runtime|7f210482-078e-4217-abc9-92b1901c9e3c]
[MFLOG|0|2024-01-10T06:46:04.086122Z|runtime|e0e3f263-4ef9-404f-8196-134901d480d3]Task failed.
  • 0.runtime_stdout.log:
[MFLOG|0|2024-01-10T06:42:38.449017Z|runtime|dfcf4833-1cab-43fd-996e-4edefcc22977][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:43:08.649344Z|runtime|e2a6331b-df2c-42ec-b401-3c0463190c3a][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:43:38.852722Z|runtime|9ebac194-f505-491e-8a17-e45a25339148][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:44:09.018890Z|runtime|085c5b85-e151-4248-92b5-49cff2390b1f][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:44:39.160499Z|runtime|4d066689-d3e0-45a1-a9ee-6aaad1fe506d][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:45:09.384003Z|runtime|de486574-58a1-4c38-b52f-326f3237b4bc][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:45:39.528610Z|runtime|ee0bd5f6-3739-4242-bab8-4eedccdb964e][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:45:45.104157Z|runtime|f36f4539-ceaa-490d-a4a0-afb0ec2ec232][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status STARTING)...
[MFLOG|0|2024-01-10T06:46:00.746162Z|runtime|a324752b-a551-46e4-9db0-f347123d424c][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status FAILED)...
  • 0.runtime
{"return_code": 1, "killed": false, "success": false}
  • 0.attempt.json
{"time": 1704868953.193214}

I suspect that the problem lies in the AWS configuration, but I'm not sure how to get to its root cause. I used the CloudFormation template without any modifications, and stack creation finished with CREATE_COMPLETE status.

Are there any additional log statements that I could add to the Metaflow code to get a more informative error message / understand which exact operation is failing?

@shchur
Copy link
Author

shchur commented Jan 10, 2024

I found the source of the problem: my working directory included a folder called metaflow, which crashed the metaflow command executed during env setup.

mkdir: cannot create directory ‘metaflow’: File exists	

It might be helpful to check for presence of this folder before submitting the job and raise an informative error message to the user.

As for debugging jobs on AWS Batch, I was able to find the detailed log with error message in the Amazon Elastic Container Service console under Clusters.

My problem is solved, but I'm keeping this issue open in case you want to add an informative error message for the edge case.

Thank you for building and maintaining such an amazing framework!

@shchur shchur closed this as completed Jan 10, 2024
@shchur shchur changed the title HelloFlow fails on AWS Batch with Data store error Metaflow crashes on AWS Batch if folder called metaflow is present in the working directory Jan 10, 2024
@shchur shchur reopened this Jan 10, 2024
@madhur-ob
Copy link
Collaborator

@shchur I wasn't able to reproduce this, can you please share your directory structure?

Mine looks the following:

Screenshot 2024-01-23 at 6 28 08 PM

and I submit the flow with python hello.py run --with batch

@shchur
Copy link
Author

shchur commented Jan 25, 2024

@madhur-ob The problem was that I built my Docker image used by Metaflow in the same directory. In other words, if the metaflow folder is present in the working directory of the Docker image, the Batch job crashes because it cannot unpack the archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants