Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remove overriding executionDeadline with maxDurationDeadline. Fixes #13044 #13049

Closed
wants to merge 1 commit into from

Conversation

leesungbin
Copy link
Contributor

Fixes #13044

Motivation

I've noticed that the maxDurationDeadline calculation process is unusual and this value is overriding executionDeadline.

For example, with the following workflow, it waits for 120 seconds and exits with error code 10:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-sample-
spec:
  activeDeadlineSeconds: 3600
  entrypoint: retry-example
  templates:
    - name: retry-example
      retryStrategy:
        limit: 2
        backoff:
          duration: 10s
          factor: 2
          maxDuration: 3m
      container:
        image: alpine
        command: [sh, -c]
        args: ["sleep 120; exit 10"]

But, when the first attempt fails, the second node's container has a deadline within 1 minute, and it is killed by its wait container.

The latest Argo workflow follows this timeline (not exact times, for easier understanding):

0s: firstChildNode started (deadline = 3600s)
120s: firstChildNode finished and exited with an error
  (deadline = firstChildNode.StartedAt + backoff.maxDuration = 180s)
130s: Waited for 10 seconds(backoff duration) & secondChildNode started (deadline: 50 seconds left)
180s: Deadline exceeded, wait container kills main container

Modifications

  1. The maxDurationDeadline was calculated by adding maxDuration to firstChildNode's startTime.
  • Changed to calculate maxDurationDeadline with lastChildNode's finishedTime.
  1. executionDeadline was overridden by maxDurationDeadline.
  • Removed this override.

Verifications

I've tested with the above workflow, and it is working as expected.

@leesungbin
Copy link
Contributor Author

I'll close this PR (updating documentation about maxDuration will be enough.)

@leesungbin leesungbin closed this May 20, 2024
@agilgur5 agilgur5 added area/controller Controller issues, panics solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) area/retryStrategy Template-level retryStrategy labels May 20, 2024
@agilgur5
Copy link
Member

Superseded by #13068 docs PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrying workflows have short deadline ignoring activeDeadlineSeconds
2 participants