Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo Workflows template exceeds max size with preceding large foreach #1538

Closed
saikonen opened this issue Sep 14, 2023 · 8 comments · Fixed by #1704
Closed

Argo Workflows template exceeds max size with preceding large foreach #1538

saikonen opened this issue Sep 14, 2023 · 8 comments · Fixed by #1704
Assignees
Labels
bug Something isn't working

Comments

@saikonen
Copy link
Collaborator

The task_id's of the preceding foreach steps for a join appear multiple times in the Argo Workflow init containers ARGO_TEMPLATE, which bloats up the size significantly. With wide foreaches the template exceeds the max size, leading to a broken flow.

Possibly a regression bug. Some discussion on the initial report here: https://outerbounds-community.slack.com/archives/C02116BBNTU/p1694529680541219

@saikonen saikonen added the bug Something isn't working label Sep 14, 2023
@alexflorezr
Copy link

Hi, by any chance do you have an update about this one? 🙏

@tslott
Copy link

tslott commented Dec 1, 2023

What is the status on this issue?

@saikonen saikonen self-assigned this Dec 7, 2023
@saikonen
Copy link
Collaborator Author

saikonen commented Dec 7, 2023

sorry for the long delay, looking into a fix for this now.

@saikonen
Copy link
Collaborator Author

Opened a first attempt for remedying the issue regarding duplicating the input-paths parameters in ARGO_TEMPLATE. Managed to rid it of duplication, but this will not solve the core issue where Argo wants to materialise the value of a Parameter into the template environment variable. In the mean time, the removal of the duplicates should bump the maximum number of foreach splits significantly, where previously the flows were failing at joins of ~2k tasks

For a future improvement which will solve the issue completely, I'm going to look into passing the input-paths through the datastore instead, but this is a bit of a bigger overhaul in general as it needs to work across cloud providers

Alternative solutions and their shortcomings:
I looked into changing the input-paths to work as an input Artifact instead of a Parameter. At first this was promising, especially since Artifacts support their value being set inline as raw-data. This however works the same way as Parameters, where the value gets materialised into the template envvar

Another option would've been to use a storage backend for the artifacts, for example S3. This requires extra configuration on the Argo infrastructure side however, and complicates the setup unnecessarily. Setting up artifact storage might also not be possible for some deployments, which would lead to completely breaking existing functionality

@tslott
Copy link

tslott commented Dec 15, 2023

Thx for looking into the issue 👌

In the mean time, the removal of the duplicates should bump the maximum number of foreach splits significantly, where previously the flows were failing at joins of ~2k tasks

I wonder what is the new upper limit? Just an estimate.

@saikonen
Copy link
Collaborator Author

Upper limit would seem to be between 3500-4500 tasks with the changes, where 3500 passes but 4500 failed. This is a slight improvement on what was previously supported, but there are some concerns as its now reading directly from the ARGO_TEMPLATE environment variable.

  • approach does not solve scaling issues completely
  • there are some proposals in the Argo project of offloading the environment variable completely, which would lead to this approach breaking in the future

@roofurmston
Copy link

In case it is of interest, it looks like the original issue in Argo Workflows has now been fixed - argoproj/argo-workflows#12325

@alexflorezr
Copy link

Hi, is there any plans to release this? I saw that there is merged PR in argo which should solve this issue. However, it was not released this time and I am not sure if they have any plan to release it soon 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants