Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data store error #1761

Open
jareenramuk opened this issue Mar 13, 2024 · 4 comments
Open

Data store error #1761

jareenramuk opened this issue Mar 13, 2024 · 4 comments

Comments

@jareenramuk
Copy link

jareenramuk commented Mar 13, 2024

I have been working on metaflow AWS batch and all of a sudden from the past few days I have been getting
"Data store error: No completed attempts of the task was found for task" at random places in the flow.

Sometimes it pops up in the start step and sometimes in the end step. I have observed this is Foreach , Flowspec parallel_map. We are using metaflow version 2.10.7

@bjupreti
Copy link

I'm running multiple jobs in AWS batch using metaflow. And recently I too have been noticing there are sporadic failures of jobs with exit code 137 in AWS batch.

I understand exit code 137 indicates memory issue, but we are seeing this error occasionally.

We tested by giving 16 GB memory and 128 GB memory for 2 jobs and passing same payload. It passed for the job with 16 GB once but it failed for 128 GB, so we are not sure if it's actually a memory issue.

Is there any chance that this is a metaflow related issue because the error we are seeing is:

Data store error: 
No completed attempts of the task was found for task

I checked in task_datastory.py file of this repository and noticed this error is thrown if 'Done.lock' file is not created.

We tested this with various of metaflow including 2.9.11 and 2.10.7 and we are seeing the same error on all of these versions.

@savingoyal
Copy link
Collaborator

@bjupreti are you able to replicate this error consistently?

@bjupreti
Copy link

No, I'm not able to replicate it consistently. The same batch job with same compute environment, resources and payload passes sometimes and fails sometimes.

@bjupreti
Copy link

In my case, I checked the logs of EC2 instances, as part of EC2 startup SSM scripts were running which resulted in restarting the docker daemon and the running job container gets stopped. ECS service comes back up, sees the container was stopped, and informs AWS that the job failed. Once the SSM scripts were not run on the instances, ECS agent service did not restart and I'm not seeing the above error anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants