Data store error #1761

jareenramuk · 2024-03-13T01:52:23Z

I have been working on metaflow AWS batch and all of a sudden from the past few days I have been getting
"Data store error: No completed attempts of the task was found for task" at random places in the flow.

Sometimes it pops up in the start step and sometimes in the end step. I have observed this is Foreach , Flowspec parallel_map. We are using metaflow version 2.10.7

bjupreti · 2024-03-14T16:40:34Z

I'm running multiple jobs in AWS batch using metaflow. And recently I too have been noticing there are sporadic failures of jobs with exit code 137 in AWS batch.

I understand exit code 137 indicates memory issue, but we are seeing this error occasionally.

We tested by giving 16 GB memory and 128 GB memory for 2 jobs and passing same payload. It passed for the job with 16 GB once but it failed for 128 GB, so we are not sure if it's actually a memory issue.

Is there any chance that this is a metaflow related issue because the error we are seeing is:

Data store error: 
No completed attempts of the task was found for task

I checked in task_datastory.py file of this repository and noticed this error is thrown if 'Done.lock' file is not created.

We tested this with various of metaflow including 2.9.11 and 2.10.7 and we are seeing the same error on all of these versions.

savingoyal · 2024-03-14T18:45:06Z

@bjupreti are you able to replicate this error consistently?

bjupreti · 2024-03-18T13:10:45Z

No, I'm not able to replicate it consistently. The same batch job with same compute environment, resources and payload passes sometimes and fails sometimes.

bjupreti · 2024-03-18T16:26:11Z

In my case, I checked the logs of EC2 instances, as part of EC2 startup SSM scripts were running which resulted in restarting the docker daemon and the running job container gets stopped. ECS service comes back up, sees the container was stopped, and informs AWS that the job failed. Once the SSM scripts were not run on the instances, ECS agent service did not restart and I'm not seeing the above error anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data store error #1761

Data store error #1761

jareenramuk commented Mar 13, 2024 •

edited

bjupreti commented Mar 14, 2024

savingoyal commented Mar 14, 2024

bjupreti commented Mar 18, 2024

bjupreti commented Mar 18, 2024

Data store error #1761

Data store error #1761

Comments

jareenramuk commented Mar 13, 2024 • edited

bjupreti commented Mar 14, 2024

savingoyal commented Mar 14, 2024

bjupreti commented Mar 18, 2024

bjupreti commented Mar 18, 2024

jareenramuk commented Mar 13, 2024 •

edited