New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data store error #1761
Comments
I'm running multiple jobs in AWS batch using metaflow. And recently I too have been noticing there are sporadic failures of jobs with exit code 137 in AWS batch. I understand exit code 137 indicates memory issue, but we are seeing this error occasionally. We tested by giving 16 GB memory and 128 GB memory for 2 jobs and passing same payload. It passed for the job with 16 GB once but it failed for 128 GB, so we are not sure if it's actually a memory issue. Is there any chance that this is a metaflow related issue because the error we are seeing is:
I checked in task_datastory.py file of this repository and noticed this error is thrown if 'Done.lock' file is not created. We tested this with various of metaflow including 2.9.11 and 2.10.7 and we are seeing the same error on all of these versions. |
@bjupreti are you able to replicate this error consistently? |
No, I'm not able to replicate it consistently. The same batch job with same compute environment, resources and payload passes sometimes and fails sometimes. |
In my case, I checked the logs of EC2 instances, as part of EC2 startup SSM scripts were running which resulted in restarting the docker daemon and the running job container gets stopped. ECS service comes back up, sees the container was stopped, and informs AWS that the job failed. Once the SSM scripts were not run on the instances, ECS agent service did not restart and I'm not seeing the above error anymore. |
I have been working on metaflow AWS batch and all of a sudden from the past few days I have been getting
"Data store error: No completed attempts of the task was found for task" at random places in the flow.
Sometimes it pops up in the start step and sometimes in the end step. I have observed this is Foreach , Flowspec parallel_map. We are using metaflow version 2.10.7
The text was updated successfully, but these errors were encountered: